Sporadically receiving connection error when doing JSON RPC requests

+2 votes
Hello,

I made a web application which sends JSON RPC requests to a multichain node on the same server (localhost), the requests work correctly most of the time, except that in about 4% of the cases the request fails and the only error I have available is what the standard .net webrequest library tells me:  "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host." which should mean that for some reason multichain is closing down the connection while in the process of sending back a response.

I've been trying to debug this for a while without success, it may happen with any JSON RPC request and generally happens when there a bunch of requests in a short time (right now I'm doing tests with 1 second delay and it happens around the tenth consecutive request).

I don't know what's happening and how to debug it since it seems to be a problem on the side of multichain and one that happens only a small percentage of the time.

I tested on multichain community 2.0.3 but I had this happen even on 2.0.2.

What can I do to further investigate this situation?
Any idea on why it could be happening?
asked Oct 23, 2019 by mgiacomellijsb

1 Answer

0 votes

If the problem is occurring when you are sending a lot of requests at the same time, the likely explanation is that some of these requests are taking a while, and your MultiChain node is not able to answer some new incoming connections because all the RPC threads are used up. Three suggestions that might help:

  • Improve the performance of the server (maybe this is a light cloud instance?)
  • Use the rpcthreads runtime parameter to increase the number of RPC commands that can be in process in parallel.
  • Increase the timeout parameters on the client side.
answered Oct 26, 2019 by MultiChain
thank you for the answer, I can't find any information on the documentation about the rpcthreads parameter, what is the default value? what is the maximum?
If you run multichaind with no parameters you'll see a full list of all runtime parameters it supports.
as I said in the edit before:
I tried to set it to 200 but consecutive requests (not parallel) still fail.
I checked the client timeout but it's 100 seconds by default, definitely not the case since the timeout is almost immediate.
It feels like multichain can't keep up with incoming consecutive requests in a short time, my virtual machine (windows) seems able to keep up with the workload.
OK, please clarify the virtualization setup, i.e. (a) what is virtualized inside what (Windows inside Windows?), and (b) relative to that setup, where the node is running and where the requests are coming from.
It's a windows server 2012 virtual machine running on azure cloud.
The node is on that machine, the web application doing the requests is on the same machine, the JSON RPC requests never leave the machine and are all internal.
Thanks - I've forwarded this to the team and will see if they have any ideas.
Faced the same problem when using version 1.0.8. Had to add retry code in .NET as a workaround. Mine was running on an Azure VM with Ubuntu 16.04.
We discussed this internally some more. It's not a problem that been reported before, so it seems more likely to be an internal networking issue on the virtual machine you are using, rather than something which is specific to MultiChain. Do you want to check if there is any kind of rate limiting being applied, either at the operating system level, or by Azure itself?
For example see this documentation:

https://docs.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput

Maybe localhost traffic is still counted towards these limits?
For clarification, I've checked and gotten a reply from Microsoft (please refer to the feedback below that link you have mentioned above) that internal traffic within the same host does not count towards these limits.
OK. And just to confirm you're connecting to the node's API using localhost/127.0.0.1 rather than the server's external IP address?
I can confirm that, every connection is towards localhost/127.0.0.1 and the relevant port
OK, thanks for your replies. The last thing I would ask you to try is some other method of sending the requests to the MultiChain node, instead of .NET and the webrequest library. For example you could use Apache JMeter to send a simple command to the node in a loop. That should help isolate whether the problem is in the calling library or in the operating system or node.
I tried with Advanced REST Client right on the server to send a bunch of getinfo requests to multichain and I haven't seen any issues even after repeatedly spamming the send button.
I'll try the following things next:
- try to recreate the specific request in ARC and spam it (it's a stream insertion with some values)
- try another implementation of the .net webrequest library (for instance restsharp instead system.net.http)
Just to add, what I am observing from the debug log after adding -debug=mcapi flag is the client encountered the error and MultiChain did not receive any input from the client. @mgiacomellijsb can you verify if that's the same for your case?

@mgiacomellijsb, I've also changed the .net webrequest to HttpClient but still faced with the same issue. Let me know your findings for restsharp.
Hey all, after a long time I managed to find time to do some more tests and the theory was right: 200 consecutive requests directly to multichain through postman work without a hitch, the same can't be said for the same amount of requests done through the .NET webrequest library.
Now I'll try to switch up to another library and see if it's what's causing the issues
@mgiacomellijsb Will be looking forward to your findings. Thanks.
I can't believe it but I managed to solve the issue (apparently) once and for all and I didn't even do anything.
Basically on the lucidocean github there was an open pull request that fixed the issue https://github.com/LucidOcean/multichain/pull/31

it was a weird timeout lease thing, I just did a stress test with over 400 consecutive calls in a very short timespan and they all went through correctly.
Multichain is not bugged, it's just the lucidocean that did the web request poorly and if you apply that patch it will be solved
Excellent findings! Thanks.
...