Node rpc completely freezes (rpcssl)

+1 vote
Hi.

Following problem: (version 1.0.1)

1. First node created

2. Second node connects to first node

3. First nodes rpc works fine (also ssl)

4. Second node does not respond to any rpc calls (even locally on the machine using curl or multichain-cli)

5. Second node can only be killed with kill -9

6.  After restart of second node, rpc is available

Last log entries of second node: (log seem nearly the same on first node)

2017-10-09 09:00:58 mchn: Connection from 1PbBy4UD7g3nv19DiodWQ51fZGQtXesaLPnNsn received on peer=11 in verackack (18.194.126.228:44442)
2017-10-09 09:00:58 mchn: Parameter set from peer=11 verified
2017-10-09 09:01:06 MultiChainMiner: Block Found - 007bc5d6629b672be74f91c586ab0be04a9077024818fa68c5b33985d083b18f, prev: 00a2f13f18dfa9c0aa13abe98c2a45eb0ee2dcf0a3f29ec9abaf1d8d35ad0966, height: 114, txs: 2
2017-10-09 09:01:06 UpdateTip:            new best=007bc5d6629b672be74f91c586ab0be04a9077024818fa68c5b33985d083b18f  height=114  log2_work=14.84549  tx=126  date=2017-10-09 09:01:06 progress=1.000000  cache=0
2017-10-09 09:01:07 ResendWalletTransactions()
asked Oct 9, 2017 by Alexoid
edited Oct 9, 2017 by Alexoid
Another finding:
It is only the rpc api that is affected. The node is still receiving transactions from the network.

And now also the first node was affected by this issue (rpc freezed).
Maybe this helps.
I now activated following debugs:  -debug=mchn -debug=mcapi

RPC freezed again, this is the last output of debug.log:

2017-10-09 14:02:06 mcapi: API request successful: listpermissions
2017-10-09 14:02:06 mcapi: API request from ::ffff:52.58.183.15
2017-10-09 14:02:06 mcapi: API request: {"method":"liststreamitems","params":["demoBucket"]}

2017-10-09 14:02:06 mcapi: API request successful: liststreamitems
2017-10-09 14:02:06 mcapi: API request from ::ffff:52.58.183.15
2017-10-09 14:02:06 mcapi: API request: {"method":"liststreamkeyitems","params":["demoBucket","0815"]}

2017-10-09 14:02:06 mcapi: API request successful: liststreamkeyitems

So nothing suspicious here. :-(
Thank you for this sharing this information. We were just going to ask you to run MultiChain with the following flags:

-debug=mchn -debug=mcblock -debug=mcapi -debug=mcminer

It looks like you are able to reproduce this problem (we can't). Can you please:

1. Try with above flags and send us the end of the debug.log if there is something suspicious there
2. Revoke mine permission from the node
3. Try it without -rpcssl

Thank you
I cannot really reproduce it. After a while the node is just not responding anymore.
I will add the other debug parameters.

My guess is that it has something to do with rpcssl since I just activated it some days ago. And before that I never had any problems.
Thank you for your response.

I'm trying to eliminate all other possible causes of the problem. So if you can run with these debug switches it would be great.

I would like also to make sure that it is not related to the miner. So, if you don't care about mining on this node and you can start multichaind with -gen=0 runtime parameter, it would be very helpful.
Hi.

I now used -debug=mchn -debug=mcapi -debug=mcminer -debug=mcblock

After some requests, the rpc is again not responding anymore.

But unfortunatly no further helpful entries in the debug.log.

Last entries are:
2017-10-10 11:43:43 mcapi: API request: {"method":"liststreamitems","params":["demoBucket"]}

2017-10-10 11:43:43 mcapi: API request successful: liststreamitems
2017-10-10 11:43:43 mcapi: API request from ::ffff:XX.58.XXX.15
2017-10-10 11:43:43 mcapi: API request: {"method":"liststreamkeyitems","params":["demoBucket","0815"]}

2017-10-10 11:43:43 mcapi: API request successful: liststreamkeyitems


Some seconds later, the second of the two nodes also stoped responding on rpc with the same log output

Now I will test it with -gen=0 on one node

#> lsof -i

says that port 8000 is still open on multichain

multichai 4924 ubuntu   24u  IPv6 610351      0t0  TCP *:8000 (LISTEN)
No difference with -gen=0.

Both nodes are again unresponsible. What I will try next: disable rpcssl and check if problem is gone.
Without ssl, there is no problem on the nodes. What I also get from time to time when ssl is enabled:

The very first rpc call after a longer idle time results in:
Post https://18.XX.XX.228:8000: read tcp 10.XX.XX.86:47470->18.XX.XX.228:8000: read: connection reset by peer


All following calls work fine again. _Only sometimes this freezes the node_

I will stop my investigation here until you have further questions...
Many thanks for doing this investigation. It looks indeed like ssl issue. We will try to reproduce this behavior and find the solution.
Just a last comment that maybe help:

When the rpc port was not reponsive (anymore) from the local machine (using curl) I still could contact it from "outside" using actual application that consumes the rpc api.

For me it seems something like: There are connections open that still work but the node does not have any capacity to open new connections...

For example, after about 10 minutes, "lsof -i" still shows 5 connections to port 8000:

multichai 1457 ubuntu    4u  IPv4 688746      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:afs3-fileserver->ec2-XX-XX-XX.eu-central-1.compute.amazonaws.com:59026 (ESTABLISHED)
multichai 1457 ubuntu   19u  IPv6 688791      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:29782 (ESTABLISHED)
multichai 1457 ubuntu   24u  IPv6 688742      0t0  TCP *:8000 (LISTEN)
multichai 1457 ubuntu   25u  IPv6 688743      0t0  TCP *:afs3-fileserver (LISTEN)
multichai 1457 ubuntu   26u  IPv4 688744      0t0  TCP *:afs3-fileserver (LISTEN)
multichai 1457 ubuntu   36u  IPv6 693541      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:31593 (ESTABLISHED)
multichai 1457 ubuntu   37u  IPv6 689010      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:41480 (ESTABLISHED)
multichai 1457 ubuntu   38u  IPv6 697229      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:8126 (ESTABLISHED)


But after some while even from the application a connection was not possible anymore (maybe the application keep-alive was triggered and the new connections failed since the old one were still open and not closed properly.


Good luck! :-)
Morning.

Seems to be quite clear now:

Multichain is not closing the open connections by itself.

I am using a go application that is running on a cloud deployment. Normally the client should close the connection after a timeout (keep-alive). But this does not seem to work (maybe firewall etc. issues)

But nevertheless if the client never closes the connection multichain also doesn't close them. And if there are 5 connections open no new connections will be processed by multichain.

Maybe multichain should actively close connections after a timeout? Or even have a flag to disable keep-alive?

Otherwise open idle connection would block the complete rpc api.


<EDIT>
Sorry, just found:  -rpckeepalive will try that...
This seems to "fix" the problem. multichain immediately closes the connections and does not get stuck anymore.

But shouldn't also the server side close idle connections after a period of time?
That does not seem to happen.

1 Answer

+1 vote
Thank you again for doing this investigation.

It looks like the problem is not related to -rpcssl, we can now simulate "stuck RPC" behaviour without -rpcssl too. Probably, without SSL your client closes connection automatically.

Actually, what you see is intended behaviour. MultiChain has 4 threads responding to RPC requests (can be overridden by -rpcthreads). If -rpckeepalive=1 and you run more than 4 clients and none of them closes connection, the 5th connection will fail.

We will set default value of -rpckeepalive to 0 in the next version of MultiChain.

In the meantime you can either start multichaind with -rpckeepalive=0 or play with -rpcthreads or close connections after every request.
answered Oct 17, 2017 by Michael
...