Diagnose Hanging Issue

+1 vote
Hi

We have a issue where one of our multichain nodes is hanging every couple of days and was looking for some advice on how to diagnose what the issue might be?

Out setup is mulichain running inside docker on a Ubuntu VM on Azure.  The rpc interface seems to just hang with no error message in the debug.logs.

When this happens, we cannot access the node at all on rpc or via the multichain-cli.  It can connect but even a getinfo command just hangs.

 

Thx

Marty
asked Sep 4, 2017 by marty

1 Answer

0 votes

Thanks for reporting this. To help us get started on this, can you please confirm:

  • Which version of MultiChain this node is running (and are all nodes running the same version)?
  • Can you confirm this is a permanent hang, and not just the API being unresponsive for a few seconds (say, for example, because it's processing a large block and running on a slow server)?
  • Does this node have some special role in the network compared to the other nodes, e.g. tracking an unusually large number of addresses or streams, or the only miner, or receiving many more API requests, etc?
  • Is this node hosted in the same environment as the other nodes?
  • What is the host's memory capacity?
  • When the node's API hangs, do you still see the send/received byte counts increasing for that node, when it is viewed in the output of getpeerinfo on other connected nodes?
  • When the node hangs, if you open top on its host server, what is its CPU and memory consumption?
answered Sep 5, 2017 by MultiChain
Thanks for the response.  Answers below:

- We are running the 1.0 release version
- It was permanent as we spent a bit of time trying to work out what was wrong
- It was the only node (of 3) that was handing requests which would have included a mix of stream writes and asset issuance / transfer
- The three nodes are within the same Azure resource group on identical builds
- 3.5 GB Memory
- I don't have any information on the send/receive count, but I did connect a new node on my laptop to the hung node and it sync'd.....just to check if the network protocol was still functioning
- Don't know memory/CPU utilization when it hung....I'll check that if/when it happens again.
 - Just to update on this, when it happened this morning CPU was below 1% but we did see a spike in disk utilisation.
 - In addition, it looked like when the problem started, if you had an existing connection you could still access the node, but new connections were refused.  Eventually existing connections where blocked as well.

Cheers

Marty
Thanks – but I'm confused by "if you had an existing connection you could still access the node, but new connections were refused" – because each MultiChain API call (whether through multichain-cli or another means) is a separate JSON-RPC HTTP request, and there is no persistence of the connections.

In any event, my first guess would be that the RPC threads are getting tied up, perhaps by large incoming requests that are somehow not completing. Could you please try stopping MultiChain and running it with -rpcthreads=16 (instead of the default of 4) to see if this causes the problem to take longer to appear? If so that would confirm my theory and we'll know where to focus (whether on a problem inside MultiChain or in the code calling its API).
Thanks

On the existing vs new connections comment, the strange thing that happened was that my colleague first reported that they could no longer access multichain via the rpc api, yet I could run getinfo against the node and it all worked fine.

The situation escalated in that the API cluster failed with requests to multichain timing out with the following:

MultichainError: Error: ESOCKETTIMEDOUT

Yet I could still run a getinfo against the node.  As soon as I tried to do something against the chain like create an asset then I also got a timeout.

Marty
The other thing to check is getwalletinfo - if utxocount is very high, that would explain why certain API operations are very slow and others are fast.
What would be considered a high utxocount?  I'm assuming our current count of 331 is nowhere near high.

I've updated the threads on the rpc interface as per your previous comment, but assuming that will just delay the issue what can we do next to understand root cause?
331 is not high at all, so that's not the issue.

So I still think the most likely issue here is to do with API requests not completing.

If you run MultiChain with -debug=mcapi then all API requests and responses should be logged to debug.log - either you or we can look at that to see if we can understand the cause.
...