Node rpc completely freezes (rpcssl)

+1 vote
Hi.

Following problem: (version 1.0.1)

1. First node created

2. Second node connects to first node

3. First nodes rpc works fine (also ssl)

4. Second node does not respond to any rpc calls (even locally on the machine using curl or multichain-cli)

5. Second node can only be killed with kill -9

6.  After restart of second node, rpc is available

Last log entries of second node: (log seem nearly the same on first node)

2017-10-09 09:00:58 mchn: Connection from 1PbBy4UD7g3nv19DiodWQ51fZGQtXesaLPnNsn received on peer=11 in verackack (18.194.126.228:44442)
2017-10-09 09:00:58 mchn: Parameter set from peer=11 verified
2017-10-09 09:01:06 MultiChainMiner: Block Found - 007bc5d6629b672be74f91c586ab0be04a9077024818fa68c5b33985d083b18f, prev: 00a2f13f18dfa9c0aa13abe98c2a45eb0ee2dcf0a3f29ec9abaf1d8d35ad0966, height: 114, txs: 2
2017-10-09 09:01:06 UpdateTip:            new best=007bc5d6629b672be74f91c586ab0be04a9077024818fa68c5b33985d083b18f  height=114  log2_work=14.84549  tx=126  date=2017-10-09 09:01:06 progress=1.000000  cache=0
2017-10-09 09:01:07 ResendWalletTransactions()
asked Oct 9, 2017 by Alexoid
edited Oct 9, 2017 by Alexoid
Without ssl, there is no problem on the nodes. What I also get from time to time when ssl is enabled:

The very first rpc call after a longer idle time results in:
Post https://18.XX.XX.228:8000: read tcp 10.XX.XX.86:47470->18.XX.XX.228:8000: read: connection reset by peer


All following calls work fine again. _Only sometimes this freezes the node_

I will stop my investigation here until you have further questions...
Many thanks for doing this investigation. It looks indeed like ssl issue. We will try to reproduce this behavior and find the solution.
Just a last comment that maybe help:

When the rpc port was not reponsive (anymore) from the local machine (using curl) I still could contact it from "outside" using actual application that consumes the rpc api.

For me it seems something like: There are connections open that still work but the node does not have any capacity to open new connections...

For example, after about 10 minutes, "lsof -i" still shows 5 connections to port 8000:

multichai 1457 ubuntu    4u  IPv4 688746      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:afs3-fileserver->ec2-XX-XX-XX.eu-central-1.compute.amazonaws.com:59026 (ESTABLISHED)
multichai 1457 ubuntu   19u  IPv6 688791      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:29782 (ESTABLISHED)
multichai 1457 ubuntu   24u  IPv6 688742      0t0  TCP *:8000 (LISTEN)
multichai 1457 ubuntu   25u  IPv6 688743      0t0  TCP *:afs3-fileserver (LISTEN)
multichai 1457 ubuntu   26u  IPv4 688744      0t0  TCP *:afs3-fileserver (LISTEN)
multichai 1457 ubuntu   36u  IPv6 693541      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:31593 (ESTABLISHED)
multichai 1457 ubuntu   37u  IPv6 689010      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:41480 (ESTABLISHED)
multichai 1457 ubuntu   38u  IPv6 697229      0t0  TCP ip-XX-XX-XX-XX.eu-central-1.compute.internal:8000->ec2-XX-XX-XX-XX.eu-central-1.compute.amazonaws.com:8126 (ESTABLISHED)


But after some while even from the application a connection was not possible anymore (maybe the application keep-alive was triggered and the new connections failed since the old one were still open and not closed properly.


Good luck! :-)
Morning.

Seems to be quite clear now:

Multichain is not closing the open connections by itself.

I am using a go application that is running on a cloud deployment. Normally the client should close the connection after a timeout (keep-alive). But this does not seem to work (maybe firewall etc. issues)

But nevertheless if the client never closes the connection multichain also doesn't close them. And if there are 5 connections open no new connections will be processed by multichain.

Maybe multichain should actively close connections after a timeout? Or even have a flag to disable keep-alive?

Otherwise open idle connection would block the complete rpc api.


<EDIT>
Sorry, just found:  -rpckeepalive will try that...
This seems to "fix" the problem. multichain immediately closes the connections and does not get stuck anymore.

But shouldn't also the server side close idle connections after a period of time?
That does not seem to happen.

1 Answer

+1 vote
Thank you again for doing this investigation.

It looks like the problem is not related to -rpcssl, we can now simulate "stuck RPC" behaviour without -rpcssl too. Probably, without SSL your client closes connection automatically.

Actually, what you see is intended behaviour. MultiChain has 4 threads responding to RPC requests (can be overridden by -rpcthreads). If -rpckeepalive=1 and you run more than 4 clients and none of them closes connection, the 5th connection will fail.

We will set default value of -rpckeepalive to 0 in the next version of MultiChain.

In the meantime you can either start multichaind with -rpckeepalive=0 or play with -rpcthreads or close connections after every request.
answered Oct 17, 2017 by Michael
...