Blockchain stops working with 3 nodes using a stream with 2M transactions

+1 vote

I have the following scenario:

  • 1 node that created the blockchain and a stream, which I used to create a stream and add 2M items into it, with random off-chain content (images).
  • 2 nodes that join the blockchain after all the data has been imported through the master node, so on startup they begin to catch up with the master node at their own pace.

At the very beginning when the 2 new joined nodes are initially catching up, I can issue commands to any of the 3 nodes using the multichain-cli tool. For instance, I send this command to the master node and I get the reply back immediately:

multichain-cli ipregister -rpcuser=multichainrpc -rpcpassword=m4lt1ch41nRpC! getaddresses
{"method":"getaddresses","params":[],"id":"85616894-1551974478","chain_name":"ipregister"}

[
    "1WTvECnAdMgvXDqMDQeiwXExCfvvMuHwq1HR7S"
]

However, at some point in time, when the volume of data being already synced up grows, the multichain-cli tool stops responding for all the nodes, and as the volume of synced data still grows, the nodes start to degrade dramatically until they just do not respond to commands.

Both 2 nodes join the blockchain through the DNS name of the master and they just subscribe to the stream to be able to find data by key, that's all.

Then I stop everything (without deleting the data) and restart first the master node. It starts to respond again to multichain-cli commands. But when I start again the first of the 2 additional nodes, I start getting this from it:

Node started
mchn: Sending minimal parameter set to 174.16.60.237:8099
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28632, us=174.16.138.178:50952, peer=1
Added time data, samples 2, offset +0 (+0 minutes)
mchn: Parameter set from peer=1 verified
Loading addresses from DNS seeds (could take a while)
0 addresses found from DNS seeds
dnsseed thread exit
socket sending timeout: 61s

And then in the master:

Sending minimal parameter set to 174.16.138.178:50952
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28627, us=174.16.60.237:8099, peer=7
Added time data, samples 3, offset +0 (+0 minutes)
mchn: Parameter set from peer=7 verified
ResendWalletTransactions()
Sending minimal parameter set to 174.16.138.178:8099
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28627, us=174.16.60.237:58276, peer=12
mchn: Parameter set from peer=12 verified

Clearly, the 2nd node degrades just after joining again the blockchain. If I join the 3rd node it gets even worse.

 

 

 

asked Mar 7 by emedina
Can you please share some information about the environment in which these MultiChain nodes are running, e.g. if it's a virtual environment or not, CPU, memory?
This is deployed in Kubernetes, using physical machines (80 cores, 512GB) and network block storage (Ceph).

1 Answer

+1 vote
 
Best answer
Thanks for more details in the comment. We've had previous reports of memory issues with Kubernetes – would you be able to re-run your experiment directly on host operating systems, so we can determine if that's the cause of the problem?
answered Mar 8 by MultiChain
selected Mar 19 by emedina
Thanks for your follow-up - we think we may have found the underlying issue for the scenario you're experiencing - more soon.
That's great news. Looking forward to your feedback!

I was also installing it in AWS using m5d.xlarge "dedicated" instances with SSD disk and facing exactly the same issues: takes minutes to subscribe (without "-txindex") with the following logs:

Still rescanning. At block 3936. Progress=1.000000
ping timeout: 60.033212s
Still rescanning. At block 7544. Progress=1.000000
Still rescanning. At block 11047. Progress=1.000000
Still rescanning. At block 14706. Progress=1.000000
Still rescanning. At block 18658. Progress=1.000000
Still rescanning. At block 22610. Progress=1.000000
mchn: Sending minimal parameter set to 172.31.18.75:48946
receive version message: /MultiChain:0.2.0.8/: version 70002, blocks=25749, us=172.31.29.148:8099, peer=9
mchn: Parameter set from peer=9 verified
mchn: Synced with node 9 on block 25749 - requesting mempool

And then just after subscribing, I send a query by key and it replies immediately. Then I leave it running for a while, get back, issue a new query, and it takes some minutes to reply (which replies immediately in the master, though).
More feedback...

Left the AWS instances running for some hours and now when I query by key using the client node, again, it takes more than 15 minutes... And the master replies immediately...
Eventually everything has been solved. Amazing support.
Just a comment for any others following this discussion – the performance issue was caused by the stream having ~100 different offchain data payloads, each of which appears 1000+ times in the stream. These heavily duplicated (and therefore identically hashed) stream items are an unusual usage pattern but we've now resolved the performance issue in this case. The fix will be in the 2.0 production release.
...