High CPU usage when attempting to connect to unavailable nodes

+3 votes

I believe I have identified a bug with the peer reconnection code. I have a node that has done nothing for the last month. A month ago I generated a few blocks and connected to it from a couple other nodes, but then shut those other nodes down. The main node has been attempting to reconnect to those 2 nodes for the last month, and spinning more and more of the CPU as each day goes by until it's using about 25% of the instance's CPU now. I got an alert about low CPU credits this morning and found this to be the problem when I investigated.

When I run the node with `-debug` I see these lines repeated over and over again:

    net: Trying to connect to 123.123.123.123:6700
    net: trying connection 123.123.123.123:6700 lastseen=627.1hrs
    net: connect() to 123.123.123.123:6700 failed after select(): Connection refused (111)
    net: Connection not established
    net: Trying to connect to 192.168.1.49:6700
    net: trying connection 192.168.1.49:6700 lastseen=627.1hrs
    net: connection to 192.168.1.49:6700 timeout
    net: Connection not established

If I start the node in either `-offline` mode, or if I delete the `peers.dat` file before starting, it does not exhibit this behavior, and sits at 0% CPU like I expect for a node that only has 50 blocks, is not mining and has no actual peers.

This is with versions 2.0a5, 2.0a6, 2.0a7 and 2.0b1.

I also tested copying the .multichain directory to another instance on a newer version of Ubuntu (bionic instead of xenial), and a non-vm computer, and I get the same behavior.

This is a graph showing how it has escalated over the last month.

How long will it wait before it stops trying to connect to these dead nodes? And why is there no way to view what is in the peers.dat file, and to clean them up? I've searched around and not found any definitive answers to how this system works. (or is supposed to work.)

Thanks!

Nate

 

asked Dec 26, 2018 by Nate

2 Answers

+1 vote
Thanks for reporting this – we'll look into it. Can you please provide a bit more information from the log (including timestamps) so we can see the frequency of reconnection attempts?
answered Dec 28, 2018 by MultiChain
I ran multichaind again for an entire hour and pasted the logs at:

https://pastebin.com/jW9yDE0a

I replaced my client's actual IP with "123.123.123.123" in the logs.

During that time there were 405 connection attempts to 123.123.123.123, and 448 connection attempts to 192.168.1.49.

Also during that time I saw the multichaind process use up CPU in the same way as before, which can be seen here:

https://imgur.com/kcHalE9

Thank you very much for the help. Hopefully it turns out to be a simple fix.

Nate
Thanks. It seems unlikely that a connection attempt every few seconds would be responsible for such high CPU usage. Is it possible this node is performing proof-of-work mining? You can post the blockchain parameters here and we can take a look.
I agree that it's very odd, but I arrived at this conclusion due to the fact that if I delete the peers.dat file, it stops behaving this way. The other thing that seems to be indicative of the problem is that it got worse as time went on, so perhaps it's something to do with the lastseen time.

I have provided the params.dat file at https://pastebin.com/E3ySYPtR with sensitive information replaced (chain name, hashes, keys, etc). But all of the settings are untouched. I'm pretty sure that the only setting that is not default is mine-empty-rounds.

Also, there is no mining happening that I am aware of. While it exhibits the issue, no new blocks are added to the chain at all, and there's nothing in the logs. During that entire month there were maybe a dozen blocks added total, as this service is still mostly idle at the moment. Also, there are no RPC calls being made at all during my retest, so I don't think it's related to that either.

Thanks!
Nate
Thanks for the update. It seems you are right that this is not mining related. The team will take a look.
Just to update you that we haven't forgotten about this and will respond with more details when we have them.
No problem, thanks for looking into it. It's not a huge rush due to the workaround of deleting peers.dat, but I have to imagine it's affecting other people as well. Though they just may not be aware of it. =)
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
+1 vote
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
answered Jan 22 by MultiChain
Awesome, thank you very much.

Nate
Hi. Regarding this issue, how can I remove a peer that is no longer part of the system? I tried getpeerinfo but it won't show the addresses of disconnected nodes. I'm trying to resolve this issue without resorting to upgrade Multichain in a production environment.

Thanks.
The only way I know of is to shut down multichaind then delete peers.dat before starting it up again. Then you'll need to have all active peers reconnect to it to repopulate the list of peers. However, this is definitely an unsupported option, and use at your own risk. I would highly suggest backing up the peers.dat file first.
Note also that peers that cannot be reached will eventually be forgotten automatically.
...