We discovered an issue yesterday whereby are test cluster stopped mining. It had been running for about a week without issue beforehand. After a number of failed attempts to get the nodes mining again we cleared it down and started it up fresh, after a short space of time the issue occurred again.
Our working theory at the minute is:
Our automated tests create new addresses/keys per run and were assigning these new addresses mining permission. The only mining parameter we had changed on our setup was admin-consensus-mine which was set to 0. Is it possible that all these additional addresses/keys on the node were affecting the rules governing mining, specifically related to the mining-diversity setting?
I have now tuned this down to zero and keeping an eye on the newly created cluster.
A couple of questions:
Do you think this was the issue, and if so then there's probably some guidance required for the proper configuration of your cluster with respect to the mining config vs nodes etc?
Secondly, how could we have recovered from this. In the end we cleared down the cluster because we couldn't get it mining. All transactions were just sitting in the mempool. If this happened in a production environment, blowing it away is not really an option :)
Assuming the issue was caused by these additional mining permissions, would revoking them have recovered the cluster?
If this is a misconfiguration issue rather than a bug/problem, some sort of log or message from the nodes would have been really useful given the amount of time we spent trying to resolve the issue.
If you think this shouldn't have been an issue, have you any other suggestions as to what the problem might have been. I backed up the volume of one of the nodes and should be able to start it up again to investigate further?