High CPU usage when attempting to connect to unavailable nodes

+5 votes

I believe I have identified a bug with the peer reconnection code. I have a node that has done nothing for the last month. A month ago I generated a few blocks and connected to it from a couple other nodes, but then shut those other nodes down. The main node has been attempting to reconnect to those 2 nodes for the last month, and spinning more and more of the CPU as each day goes by until it's using about 25% of the instance's CPU now. I got an alert about low CPU credits this morning and found this to be the problem when I investigated.

When I run the node with `-debug` I see these lines repeated over and over again:

    net: Trying to connect to 123.123.123.123:6700
    net: trying connection 123.123.123.123:6700 lastseen=627.1hrs
    net: connect() to 123.123.123.123:6700 failed after select(): Connection refused (111)
    net: Connection not established
    net: Trying to connect to 192.168.1.49:6700
    net: trying connection 192.168.1.49:6700 lastseen=627.1hrs
    net: connection to 192.168.1.49:6700 timeout
    net: Connection not established

If I start the node in either `-offline` mode, or if I delete the `peers.dat` file before starting, it does not exhibit this behavior, and sits at 0% CPU like I expect for a node that only has 50 blocks, is not mining and has no actual peers.

This is with versions 2.0a5, 2.0a6, 2.0a7 and 2.0b1.

I also tested copying the .multichain directory to another instance on a newer version of Ubuntu (bionic instead of xenial), and a non-vm computer, and I get the same behavior.

This is a graph showing how it has escalated over the last month.

How long will it wait before it stops trying to connect to these dead nodes? And why is there no way to view what is in the peers.dat file, and to clean them up? I've searched around and not found any definitive answers to how this system works. (or is supposed to work.)

Thanks!

Nate

 

asked Dec 26, 2018 by Nate

2 Answers

+1 vote
Thanks for reporting this – we'll look into it. Can you please provide a bit more information from the log (including timestamps) so we can see the frequency of reconnection attempts?
answered Dec 28, 2018 by MultiChain
I ran multichaind again for an entire hour and pasted the logs at:

https://pastebin.com/jW9yDE0a

I replaced my client's actual IP with "123.123.123.123" in the logs.

During that time there were 405 connection attempts to 123.123.123.123, and 448 connection attempts to 192.168.1.49.

Also during that time I saw the multichaind process use up CPU in the same way as before, which can be seen here:

https://imgur.com/kcHalE9

Thank you very much for the help. Hopefully it turns out to be a simple fix.

Nate
Thanks. It seems unlikely that a connection attempt every few seconds would be responsible for such high CPU usage. Is it possible this node is performing proof-of-work mining? You can post the blockchain parameters here and we can take a look.
I agree that it's very odd, but I arrived at this conclusion due to the fact that if I delete the peers.dat file, it stops behaving this way. The other thing that seems to be indicative of the problem is that it got worse as time went on, so perhaps it's something to do with the lastseen time.

I have provided the params.dat file at https://pastebin.com/E3ySYPtR with sensitive information replaced (chain name, hashes, keys, etc). But all of the settings are untouched. I'm pretty sure that the only setting that is not default is mine-empty-rounds.

Also, there is no mining happening that I am aware of. While it exhibits the issue, no new blocks are added to the chain at all, and there's nothing in the logs. During that entire month there were maybe a dozen blocks added total, as this service is still mostly idle at the moment. Also, there are no RPC calls being made at all during my retest, so I don't think it's related to that either.

Thanks!
Nate
Thanks for the update. It seems you are right that this is not mining related. The team will take a look.
Just to update you that we haven't forgotten about this and will respond with more details when we have them.
No problem, thanks for looking into it. It's not a huge rush due to the workaround of deleting peers.dat, but I have to imagine it's affecting other people as well. Though they just may not be aware of it. =)
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
+1 vote
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
answered Jan 22, 2019 by MultiChain
So I got an answer. There isn't currently a way to explicitly forget a node (which wasn't added via addnode) in either version. But the connection manager in 2.2.1 should prevent CPU spikes due to connection attempts.
Thank you very much for getting back to me, much appreciated. I'll upgrade multichain to 2.2.1 and give it a try.

If you don't mind, couple of questions please.

1) What happens if I revoke all the permissions (connect etc) granted originally as part of “Connecting to a blockchain” procedure, delete all peers.dat files in all nodes, then re-connect the same nodes again using “multichaind chain1@12.34.56.78:8571 -daemon” process and re-grant the permissions? If I do this, would the old data inside the blockchain gets corrupted or would the nodes be able to re-sync on existing addresses of the nodes and be back to normality fully?

2) In one of the earlier comments, you've mentioned about peers that cannot be reached will eventually be forgotten automatically. How long would the nodes wait before forgetting stale connections automatically? Or is this feature redundant from 2.2.1 onwards?

Thanks in advance!
Before I answer these questions, there are two facts which will make the situation clearer:

1. Permissions (including connect) are stored on blockchain and can be added/removed only via grant/revoke APIs. They don't depend on the connection state of specific node and all nodes (which are in-sync) share the same picture.
2. Network addresses (IP:port) are stored in peers.dat. From 2.2.1 onwards, additional information is stored in addrs.dat. This information is per-node and may be very different on different nodes.

Now, to your questions

1) Permissions are unchanged and blockchain data is not corrupted. But if you remove peers.dat (and addrs.dat), the nodes will not be able to find each other. So, you have to add nodes using “multichaind chain1@12.34.56.78:8571" or -addnode runtime parameter or addnode API. But you don't have to re-grant permissions.

2) Nodes are never forgotten completely even before 2.2.1 - they are just re-tried less frequently (and potentially with higher CPU usage in some rare situations). From 2.2.1 onwards, they are never forgotten and they are retried regularly (though less frequently) if the node is not connected properly, but this should never result in CPU usage spike.

We consider developing of some mechanism for removing old addresses in the future versions.
Hi Michael,

Firstly, thank you for taking the time to provide additional clarity and details for my questions, much appreciated.

In the last few days, I took backup of my nodes and experimented with a few methods to see if there is a mechanism to remove such stale nodes in a scenario like mine. The below steps worked for me perfectly; listing them down in case anyone else face the same situation in future and might need a similar solution.

1) Check whether all nodes are in-sync, then stop all 3 nodes and take a full backup of the multichain directory
2) Ensure there are no connections to the blockchain either sending or querying data
3) Node 1 - delete peers.dat (I’m on a slightly older version btw) file and restart the node, this would create a new/blank peers.dat
4) Node 2 - delete peers.dat file and start the node using “multichaind chain1@12.34.56.78:8571" method to establish connectivity with node 1. Both nodes are connected now, there wasn’t any message to grant permissions because the core address of node 2 was unchanged and it was already granted all necessary permissions (including connect) before this issue popped up. Stop Node 2 and restart it in normal mode i.e. without the chain1@12.34.56.78:8571 in the command
5) Node 3 - same steps as node 2
6) Shutdown all 3 nodes and restart all of them in normal mode and they all connected automatically with fresh peers.dat without stale connection details as before this whole issue started. No more errors in the logs, no more CPU spikes, all data is intact; I was able to retrieve old data and insert new data smoothly, everything is back to normal

Pretty much all of the above steps are in-sync with what you mentioned as response to my question #1. Thank you again for your help.
Great, thanks for providing so much detail.
...