High CPU usage when attempting to connect to unavailable nodes

+3 votes

I believe I have identified a bug with the peer reconnection code. I have a node that has done nothing for the last month. A month ago I generated a few blocks and connected to it from a couple other nodes, but then shut those other nodes down. The main node has been attempting to reconnect to those 2 nodes for the last month, and spinning more and more of the CPU as each day goes by until it's using about 25% of the instance's CPU now. I got an alert about low CPU credits this morning and found this to be the problem when I investigated.

When I run the node with `-debug` I see these lines repeated over and over again:

    net: Trying to connect to 123.123.123.123:6700
    net: trying connection 123.123.123.123:6700 lastseen=627.1hrs
    net: connect() to 123.123.123.123:6700 failed after select(): Connection refused (111)
    net: Connection not established
    net: Trying to connect to 192.168.1.49:6700
    net: trying connection 192.168.1.49:6700 lastseen=627.1hrs
    net: connection to 192.168.1.49:6700 timeout
    net: Connection not established

If I start the node in either `-offline` mode, or if I delete the `peers.dat` file before starting, it does not exhibit this behavior, and sits at 0% CPU like I expect for a node that only has 50 blocks, is not mining and has no actual peers.

This is with versions 2.0a5, 2.0a6, 2.0a7 and 2.0b1.

I also tested copying the .multichain directory to another instance on a newer version of Ubuntu (bionic instead of xenial), and a non-vm computer, and I get the same behavior.

This is a graph showing how it has escalated over the last month.

How long will it wait before it stops trying to connect to these dead nodes? And why is there no way to view what is in the peers.dat file, and to clean them up? I've searched around and not found any definitive answers to how this system works. (or is supposed to work.)

Thanks!

Nate

 

asked Dec 26, 2018 by Nate

2 Answers

+1 vote
Thanks for reporting this – we'll look into it. Can you please provide a bit more information from the log (including timestamps) so we can see the frequency of reconnection attempts?
answered Dec 28, 2018 by MultiChain
I ran multichaind again for an entire hour and pasted the logs at:

https://pastebin.com/jW9yDE0a

I replaced my client's actual IP with "123.123.123.123" in the logs.

During that time there were 405 connection attempts to 123.123.123.123, and 448 connection attempts to 192.168.1.49.

Also during that time I saw the multichaind process use up CPU in the same way as before, which can be seen here:

https://imgur.com/kcHalE9

Thank you very much for the help. Hopefully it turns out to be a simple fix.

Nate
Thanks. It seems unlikely that a connection attempt every few seconds would be responsible for such high CPU usage. Is it possible this node is performing proof-of-work mining? You can post the blockchain parameters here and we can take a look.
I agree that it's very odd, but I arrived at this conclusion due to the fact that if I delete the peers.dat file, it stops behaving this way. The other thing that seems to be indicative of the problem is that it got worse as time went on, so perhaps it's something to do with the lastseen time.

I have provided the params.dat file at https://pastebin.com/E3ySYPtR with sensitive information replaced (chain name, hashes, keys, etc). But all of the settings are untouched. I'm pretty sure that the only setting that is not default is mine-empty-rounds.

Also, there is no mining happening that I am aware of. While it exhibits the issue, no new blocks are added to the chain at all, and there's nothing in the logs. During that entire month there were maybe a dozen blocks added total, as this service is still mostly idle at the moment. Also, there are no RPC calls being made at all during my retest, so I don't think it's related to that either.

Thanks!
Nate
Thanks for the update. It seems you are right that this is not mining related. The team will take a look.
Just to update you that we haven't forgotten about this and will respond with more details when we have them.
No problem, thanks for looking into it. It's not a huge rush due to the workaround of deleting peers.dat, but I have to imagine it's affecting other people as well. Though they just may not be aware of it. =)
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
+1 vote
We've identified and fixed this problem, and the fix will appear in the next release of both MultiChain 1.0.x and 2.0 beta.
answered Jan 22, 2019 by MultiChain
Awesome, thank you very much.

Nate
Hi. Regarding this issue, how can I remove a peer that is no longer part of the system? I tried getpeerinfo but it won't show the addresses of disconnected nodes. I'm trying to resolve this issue without resorting to upgrade Multichain in a production environment.

Thanks.
The only way I know of is to shut down multichaind then delete peers.dat before starting it up again. Then you'll need to have all active peers reconnect to it to repopulate the list of peers. However, this is definitely an unsupported option, and use at your own risk. I would highly suggest backing up the peers.dat file first.
Note also that peers that cannot be reached will eventually be forgotten automatically.
Dear MultiChain team,

I've noticed this issue re-surfaced in v2.1.1.

Scenario: I have a private network of 3 nodes (in AWS) and they've been working fine for a number of months without any issues. Recently, as part of routine maintenance, I've restarted all 3 nodes one after another in sequence, and noticed unusual number of errors in logs as well as CPU spiking up in two of the three nodes (similar to the graphs that Nate pointed in the original post).

Upon inspecting various things, I concluded that the spike and errors are because Nodes 1 and 2 and trying to connect/establish a peer with Node 3 even though they are already connected and working fine together. Looks like there is a stale/unnecessary entry belonging to Node 3 inside the peers.dat file making Nodes 1 and 2 trying to unsuccessfully connect to this stale entry of Node 3.

I've tested a number of transactions, and can confirm all 3 nodes are up and running correctly (getpeerinfo showing peers correctly), all new blocks inserted on any of the three nodes are getting synced across all 3 nodes as normal. The only problems in this scenario are CPU spikes in Nodes 1 and 2, and lots of error messages in the logs.

Below is the sample log file with out debug options:
2022-03-14 10:51:26 socket send error Broken pipe (32)
2022-03-14 10:51:29 socket send error Broken pipe (32)
2022-03-14 10:51:33 socket send error Broken pipe (32)
2022-03-14 10:51:34 socket send error Broken pipe (32)
2022-03-14 10:52:29 socket send error Broken pipe (32)
2022-03-14 10:52:30 socket send error Broken pipe (32)
2022-03-14 10:52:31 socket send error Broken pipe (32)
2022-03-14 10:52:34 socket send error Broken pipe (32)
2022-03-14 10:52:43 socket send error Broken pipe (32)
2022-03-14 10:52:51 socket send error Broken pipe (32)
2022-03-14 10:52:57 socket send error Broken pipe (32)
2022-03-14 10:53:21 CommitTransaction: 44191d2aae3c5815f5ad4262b43e2f917f1fc043b8882152cb94cee189873f49, vin: 1, vout: 3
2022-03-14 10:53:25 socket send error Broken pipe (32)
2022-03-14 10:53:26 UpdateTip:            new best=5c8c943d2dc3c6ecc1e41c3876e9536557847cbb20033bf48065f33f9302c731  height=813  log2_work=10.668885  tx=1115  date=2022-03-14 10:53:26 progress=1.000000  cache=0
2022-03-14 10:53:28 UpdateTip:            new best=56709fe9dbddf4f2e95de9a875b32c84582a8baa228e41cbea8f7c589b58d60f  height=814  log2_work=10.670656  tx=1116  date=2022-03-14 10:53:28 progress=1.000000  cache=0
2022-03-14 10:53:29 ResendWalletTransactions()
2022-03-14 10:53:30 socket send error Broken pipe (32)
2022-03-14 10:53:31 UpdateTip:            new best=1854643643582dbfc6a4f054c9b8428f3e30c438b715cf328adbe2705c786adc  height=815  log2_work=10.672425  tx=1117  date=2022-03-14 10:53:31 progress=1.000000  cache=0
2022-03-14 10:53:33 ResendWalletTransactions()
2022-03-14 10:53:37 socket send error Broken pipe (32)
2022-03-14 10:53:43 socket send error Broken pipe (32)
2022-03-14 10:53:49 socket send error Broken pipe (32)
2022-03-14 10:53:59 socket send error Broken pipe (32)
2022-03-14 10:54:12 socket send error Broken pipe (32)
2022-03-14 10:54:20 socket send error Broken pipe (32)


Below is the sample log file with debug options:
2022-03-14 06:05:27 net: Trying to connect to [fe80::848:bdff:gd9a:32a0]:1234
2022-03-14 06:05:27 net: trying connection [fe80::848:bdff:gd9a:32a0]:1234 lastseen=3295.4hrs
2022-03-14 06:05:27 Added connection peer=362
2022-03-14 06:05:27 send version message: version 70002, blocks=803, us=[fe80::f9:f7ff:fe9b:e200]:1234, peer=362
2022-03-14 06:05:27 sending: version mchn: SEND: version
2022-03-14 06:05:27 (115 bytes) peer=362
2022-03-14 06:05:27 socket send error Broken pipe (32)
2022-03-14 06:05:27 disconnecting peer=362
2022-03-14 06:05:27 net: Connection to [fe80::848:bdff:gd9a:32a0]:1234 established
2022-03-14 06:05:27 disconnect flag set
2022-03-14 06:05:29 net: Trying to connect to [fe80::848:bdff:gd9a:32a0]:1234
2022-03-14 06:05:29 net: trying connection [fe80::848:bdff:gd9a:32a0]:1234 lastseen=3295.4hrs
2022-03-14 06:05:29 Added connection peer=363
2022-03-14 06:05:29 send version message: version 70002, blocks=803, us=[fe80::f9:f7ff:fe9b:e200]:1234, peer=363
2022-03-14 06:05:29 sending: version mchn: SEND: version
2022-03-14 06:05:29 (115 bytes) peer=363
2022-03-14 06:05:29 socket send error Broken pipe (32)
2022-03-14 06:05:29 disconnecting peer=363
2022-03-14 06:05:29 net: Connection to [fe80::848:bdff:gd9a:32a0]:1234 established
2022-03-14 06:05:29 disconnect flag set
2022-03-14 06:05:30 net: Trying to connect to [fe80::848:bdff:gd9a:32a0]:1234
2022-03-14 06:05:30 net: trying connection [fe80::848:bdff:gd9a:32a0]:1234 lastseen=3295.4hrs
2022-03-14 06:05:30 Added connection peer=364
2022-03-14 06:05:30 send version message: version 70002, blocks=803, us=[fe80::f9:f7ff:fe9b:e200]:1234, peer=364
2022-03-14 06:05:30 sending: version mchn: SEND: version
2022-03-14 06:05:30 (115 bytes) peer=364
2022-03-14 06:05:30 socket send error Broken pipe (32)
2022-03-14 06:05:30 disconnecting peer=364
2022-03-14 06:05:30 net: Connection to [fe80::848:bdff:gd9a:32a0]:1234 established
2022-03-14 06:05:30 disconnect flag set


fe80::848:bdff:gd9a:32a0 belongs to Node 3 (it’s in localaddresses under getnetworkinfo) btw.

I’ve looked through all the documentation and Q&As and couldn’t find a solution for how to rectify this situation? This is the first time such an issue happened in more than an year of usage of the same setup with similar type of maintenance (restart of the machines and nodes). Could you please help me answering the below:

1) How can I get rid of the unwanted entry inside the peers.dat file? I’ve tried deleting peers.dat in Nodes 1 and 2, but a copy was sent across automatically from Node 3 repeating the same issue. If I delete peers.dat in all nodes, then I believe these nodes won’t connect anymore automatically.

2) Can I delete all peers.dat files in all nodes, then revoke the permissions (connect etc) granted as part of “Connecting to a blockchain” procedure, then re-connect the nodes freshly using “multichaind chain1@12.34.56.78:8571 -daemon” process and re-grant the permissions? If I do this, would the old data inside the blockchain gets corrupted or would the nodes be able to re-sync on existing addresses of the nodes and be back to normality fully?

3) How long would the nodes wait before dropping stale connections automatically? From the debug log above looks like lastseen=3295.4hrs but still attempts are continuing to be made to connect spiking up the CPU? It would be really helpful to have a disconnect/reconnect method for nodes added through the process of “multichaind chain1@12.34.56.78:8571 -daemon” rather than via addnode API. FYI - the getaddednodeinfo API is not showing any nodes even when all 3 nodes are connected.

Any help in this matter is much appreciated. Thank you in advance.
Can I recommend that you try 2.2.1 beta 1? It has a totally rewritten peer connection management algorithm and may solve this problem.
Thank you for the response. Sure, will give 2.2.1 a try.

Out of curiosity, if the same issue happens in 2.2.1 or in the future, is there a mechanism/procedure to get rid of the stale node in the peers.dat file, or is the only way is to wait until multichain automatically recognises and removes this connection. If it's the latter, what's the expected wait time before this happens? Please let me know.
Somewhat counterintuitively, the addnode command can also be used to remove a node, please see the docs: https://www.multichain.com/developers/json-rpc-api/
Thank you again for the suggestion on how to remove a node, appreciated.

Would this work on a node that's not added via the addnode command, but rather added via “multichaind chain1@12.34.56.78:8571 -daemon” process and granted the permissions (per instructions @ https://www.multichain.com/developers/creating-connecting/)?

I can see my nodes connected through getpeerinfo, but when I tried getaddednodeinfo command, nothing is showing up in any of the nodes. So I assumed, the remove a node through addnode commands works only for the nodes originally added via the addnode command only?

Could you kindly clarify.
Sorry, I think you're right, this only affects the addnode list. I'll ask the dev team for an answer to your earlier question.
So I got an answer. There isn't currently a way to explicitly forget a node (which wasn't added via addnode) in either version. But the connection manager in 2.2.1 should prevent CPU spikes due to connection attempts.
Thank you very much for getting back to me, much appreciated. I'll upgrade multichain to 2.2.1 and give it a try.

If you don't mind, couple of questions please.

1) What happens if I revoke all the permissions (connect etc) granted originally as part of “Connecting to a blockchain” procedure, delete all peers.dat files in all nodes, then re-connect the same nodes again using “multichaind chain1@12.34.56.78:8571 -daemon” process and re-grant the permissions? If I do this, would the old data inside the blockchain gets corrupted or would the nodes be able to re-sync on existing addresses of the nodes and be back to normality fully?

2) In one of the earlier comments, you've mentioned about peers that cannot be reached will eventually be forgotten automatically. How long would the nodes wait before forgetting stale connections automatically? Or is this feature redundant from 2.2.1 onwards?

Thanks in advance!
Before I answer these questions, there are two facts which will make the situation clearer:

1. Permissions (including connect) are stored on blockchain and can be added/removed only via grant/revoke APIs. They don't depend on the connection state of specific node and all nodes (which are in-sync) share the same picture.
2. Network addresses (IP:port) are stored in peers.dat. From 2.2.1 onwards, additional information is stored in addrs.dat. This information is per-node and may be very different on different nodes.

Now, to your questions

1) Permissions are unchanged and blockchain data is not corrupted. But if you remove peers.dat (and addrs.dat), the nodes will not be able to find each other. So, you have to add nodes using “multichaind chain1@12.34.56.78:8571" or -addnode runtime parameter or addnode API. But you don't have to re-grant permissions.

2) Nodes are never forgotten completely even before 2.2.1 - they are just re-tried less frequently (and potentially with higher CPU usage in some rare situations). From 2.2.1 onwards, they are never forgotten and they are retried regularly (though less frequently) if the node is not connected properly, but this should never result in CPU usage spike.

We consider developing of some mechanism for removing old addresses in the future versions.
Hi Michael,

Firstly, thank you for taking the time to provide additional clarity and details for my questions, much appreciated.

In the last few days, I took backup of my nodes and experimented with a few methods to see if there is a mechanism to remove such stale nodes in a scenario like mine. The below steps worked for me perfectly; listing them down in case anyone else face the same situation in future and might need a similar solution.

1) Check whether all nodes are in-sync, then stop all 3 nodes and take a full backup of the multichain directory
2) Ensure there are no connections to the blockchain either sending or querying data
3) Node 1 - delete peers.dat (I’m on a slightly older version btw) file and restart the node, this would create a new/blank peers.dat
4) Node 2 - delete peers.dat file and start the node using “multichaind chain1@12.34.56.78:8571" method to establish connectivity with node 1. Both nodes are connected now, there wasn’t any message to grant permissions because the core address of node 2 was unchanged and it was already granted all necessary permissions (including connect) before this issue popped up. Stop Node 2 and restart it in normal mode i.e. without the chain1@12.34.56.78:8571 in the command
5) Node 3 - same steps as node 2
6) Shutdown all 3 nodes and restart all of them in normal mode and they all connected automatically with fresh peers.dat without stale connection details as before this whole issue started. No more errors in the logs, no more CPU spikes, all data is intact; I was able to retrieve old data and insert new data smoothly, everything is back to normal

Pretty much all of the above steps are in-sync with what you mentioned as response to my question #1. Thank you again for your help.
Great, thanks for providing so much detail.
...