Blockchain stops working with 3 nodes using a stream with 2M transactions

+1 vote

I have the following scenario:

  • 1 node that created the blockchain and a stream, which I used to create a stream and add 2M items into it, with random off-chain content (images).
  • 2 nodes that join the blockchain after all the data has been imported through the master node, so on startup they begin to catch up with the master node at their own pace.

At the very beginning when the 2 new joined nodes are initially catching up, I can issue commands to any of the 3 nodes using the multichain-cli tool. For instance, I send this command to the master node and I get the reply back immediately:

multichain-cli ipregister -rpcuser=multichainrpc -rpcpassword=m4lt1ch41nRpC! getaddresses
{"method":"getaddresses","params":[],"id":"85616894-1551974478","chain_name":"ipregister"}

[
    "1WTvECnAdMgvXDqMDQeiwXExCfvvMuHwq1HR7S"
]

However, at some point in time, when the volume of data being already synced up grows, the multichain-cli tool stops responding for all the nodes, and as the volume of synced data still grows, the nodes start to degrade dramatically until they just do not respond to commands.

Both 2 nodes join the blockchain through the DNS name of the master and they just subscribe to the stream to be able to find data by key, that's all.

Then I stop everything (without deleting the data) and restart first the master node. It starts to respond again to multichain-cli commands. But when I start again the first of the 2 additional nodes, I start getting this from it:

Node started
mchn: Sending minimal parameter set to 174.16.60.237:8099
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28632, us=174.16.138.178:50952, peer=1
Added time data, samples 2, offset +0 (+0 minutes)
mchn: Parameter set from peer=1 verified
Loading addresses from DNS seeds (could take a while)
0 addresses found from DNS seeds
dnsseed thread exit
socket sending timeout: 61s

And then in the master:

Sending minimal parameter set to 174.16.138.178:50952
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28627, us=174.16.60.237:8099, peer=7
Added time data, samples 3, offset +0 (+0 minutes)
mchn: Parameter set from peer=7 verified
ResendWalletTransactions()
Sending minimal parameter set to 174.16.138.178:8099
receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=28627, us=174.16.60.237:58276, peer=12
mchn: Parameter set from peer=12 verified

Clearly, the 2nd node degrades just after joining again the blockchain. If I join the 3rd node it gets even worse.

 

 

 

asked Mar 7, 2019 by emedina
Can you please share some information about the environment in which these MultiChain nodes are running, e.g. if it's a virtual environment or not, CPU, memory?
This is deployed in Kubernetes, using physical machines (80 cores, 512GB) and network block storage (Ceph).

1 Answer

+1 vote
 
Best answer
Thanks for more details in the comment. We've had previous reports of memory issues with Kubernetes – would you be able to re-run your experiment directly on host operating systems, so we can determine if that's the cause of the problem?
answered Mar 8, 2019 by MultiChain
selected Mar 19, 2019 by emedina
Such memory issues were related to Multichain running inside a container or just not enough memory assigned to the process?
Also while one node is trying to catch up with the master, the logs throw constantly something like this:

test-multichain/multichain-node-0[multichain]: ResendWalletTransactions()
test-multichain/multichain-node-0[multichain]: Timeout downloading block 005d94ebfb4f7bb363a9a70e7af5068362ffb74b6a17b88a585fa824ebba2ebf from peer=164, disconnecting
test-multichain/multichain-node-0[multichain]: mchn: Sending minimal parameter set to 174.16.60.237:8099
test-multichain/multichain-node-0[multichain]: receive version message: /MultiChain:0.2.0.6/: version 70002, blocks=24751, us=174.16.35.7:37008, peer=165
test-multichain/multichain-node-0[multichain]: mchn: Parameter set from peer=165 verified
test-multichain/multichain-node-0[multichain]: ResendWalletTransactions()

Is this normal? I mean, the fact that the peer number is incremented on each timeout.
More evidence (after having set 8Gb RAM for all nodes):

test-multichain/multichain-node-0[multichain]: Still rescanning. At block 15704. Progress=1.000000
test-multichain/multichain-node-1[multichain]: socket receive timeout: 61s
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 310. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 15977. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 558. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 16208. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 799. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 16459. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 1034. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 16685. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 1290. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 16929. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 1563. Progress=1.000000
test-multichain/multichain-node-0[multichain]: Still rescanning. At block 17146. Progress=1.000000
test-multichain/multichain-node-1[multichain]: Still rescanning. At block 1784. Progress=1.000000

Notice both node 0 and node 1 doing some sort of rescan and throwing timeouts.

Such rescan seems to be very slow. Is it similar to using -txindex on node startup?

The master node always responds and seems to be working fine all the time no matter how unresponsive the other nodes behave.
Regarding Kubernetes, we had reports of a memory issue inside Kubernetes itself that interferes with MultiChain's functioning. So I think the first step to assessing this problem is to confirm whether it can be reproduced outside of a Kubernetes environment.
Same behaviour (maybe it takes a bit more time to degrade) executing it outside of Kubernetes in physical servers.

If you provide a place where I can upload 5GB, I can share the data folder for you to reproduce it.
More feedback. Set up another clean physical environment. Started a single node using the existing data folder from my previous import. Getting this error if I use -txindex or -reindex:

# tail -f /data/ipregister/debug.log

************************
EXCEPTION: St13runtime_error
ReserveKeyFromKeyPool() : read failed
bitcoin in AppInit()

2019-03-12 09:34:19 Shutdown: In progress...
2019-03-12 09:34:19 StopNode()
2019-03-12 09:34:19 Node stopped
2019-03-12 09:34:19 Shutdown: done

However, if I don't use those flags, it starts up properly, but then the CLI does not respond...
More findings. When I start up the node without those flags, these are the logs:

# tail -f /data/ipregister/debug.log
2019-03-12 09:45:13 Using BerkeleyDB version Berkeley DB 4.8.30: (April  9, 2010)
2019-03-12 09:45:13 Default data directory /data/ipregister
2019-03-12 09:45:13 Using data directory /data/ipregister
2019-03-12 09:45:13 Using config file /data/ipregister/ipregister.conf
2019-03-12 09:45:13 Using at most 125 connections (1000000 file descriptors available)
2019-03-12 09:45:13 Using 16 threads for script verification
2019-03-12 09:45:13 Using wallet wallet.dat
2019-03-12 09:45:13 init message: Verifying wallet...
2019-03-12 09:45:13 Wallet file exists. WalletDBVersion: 2.
2019-03-12 09:45:13 CDBEnv::Open : LogDir=/data/ipregister/database ErrorFile=/data/ipregister/db.log
2019-03-12 09:45:16 init message: Initializing multichain...
2019-03-12 09:45:16 mchn: Parameter set is valid - initializing blockchain parameters...
2019-03-12 09:45:28 mchn: Burn address: 1XXXXXXWaXXXXXXXvJXXXXXXTBXXXXXXYYrjCv
2019-03-12 09:45:51 Wallet mode: 00000003
2019-03-12 09:45:51 nFileVersion = 100000
2019-03-12 09:45:51 Keys: 3 plaintext, 0 encrypted, 3 w/ metadata, 3 total
2019-03-12 09:45:51 mchn: Default connect  address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default send     address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default receive  address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default create   address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default issue    address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default mine     address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default admin    address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 mchn: Default activate address: 1KC2pYE35C75zLGY5WefjinGzmtNVvjDyhCvp
2019-03-12 09:45:51 Node paused state is set to 00000000
2019-03-12 09:45:51 Binding RPC on address :: port 8570 (IPv4+IPv6 bind any: 1)
2019-03-12 09:45:51 Bound to [::]:8099
2019-03-12 09:45:51 Bound to 0.0.0.0:8099
2019-03-12 09:45:51 Filter initialization completed
2019-03-12 09:45:51 Setting banned transaction list:    0 transactions
2019-03-12 09:45:51 init message: Loading block index...
2019-03-12 09:45:51 Opening LevelDB in /data/ipregister/blocks/index
2019-03-12 09:45:51 Opened LevelDB successfully
2019-03-12 09:45:51 Opening LevelDB in /data/ipregister/chainstate
2019-03-12 09:45:52 Opened LevelDB successfully
2019-03-12 09:45:52 nFileVersion = 100000
2019-03-12 09:45:52 Keys: 3 plaintext, 0 encrypted, 3 w/ metadata, 3 total
2019-03-12 09:45:52 LoadBlockIndexDB: last block file = 6
2019-03-12 09:45:52 LoadBlockIndexDB: last block file info: CBlockFileInfo(blocks=3460, size=106831248, heights=21294...24753, time=2019-03-10...2019-03-10)
2019-03-12 09:45:52 Checking all blk files are present...
2019-03-12 09:45:52 LoadBlockIndexDB(): transaction index enabled
2019-03-12 09:45:52 mchn: Rolling back permission DB to height 24753
2019-03-12 09:45:52 mchn: Rolling back asset DB to height 24753
2019-03-12 09:45:52 mchn: Rolling back wallet txs DB to height 24753
2019-03-12 09:45:52 LoadBlockIndexDB(): hashBestChain=00d5a39421b6aebc877880260162b4c479c9b4d06fed0b9fc115f30509af312a height=24753 date=2019-03-10 13:51:25 progress=1.000000
2019-03-12 09:45:52 init message: Verifying blocks...
2019-03-12 09:45:52 Verifying last 288 blocks at level 3
2019-03-12 09:45:52 No coin database inconsistencies in last 289 blocks (19081 transactions)
2019-03-12 09:45:52  block index             916ms
2019-03-12 09:45:52 MultiChain protocol version: 20006
2019-03-12 09:45:52 init message: Loading wallet...
2019-03-12 09:45:52 nFileVersion = 100000
2019-03-12 09:45:52 Keys: 3 plaintext, 0 encrypted, 3 w/ metadata, 3 total
2019-03-12 09:45:52  wallet                    0ms
2019-03-12 09:45:52 mchn: Found 0 assets in 0 groups
2019-03-12 09:45:52 mchn: Asset Grouping. Assets: 0. Group Size: 1. Group Count: 0. Single-Asset Groups: 0.
2019-03-12 09:45:52 mchn: Unspent list initialized: Total: 0, Unspent: 0
2019-03-12 09:45:52 mapBlockIndex.size() = 24754
2019-03-12 09:45:52 nBestHeight = 24753
2019-03-12 09:45:52 setKeyPool.size() = 2
2019-03-12 09:45:52 oldWallet.size() = 0
2019-03-12 09:45:52 mapWallet.size() = 4459059
2019-03-12 09:45:52 mapAddressBook.size() = 1
2019-03-12 09:45:52 init message: Loading addresses...
2019-03-12 09:45:52 Loaded 5 addresses from peers.dat  0ms
2019-03-12 09:45:52 AddLocal(10.138.132.7:8099,1)
2019-03-12 09:45:52 Discover: IPv4 enp3s0f0: 10.138.132.7
2019-03-12 09:45:52 AddLocal(172.17.0.1:8099,1)
2019-03-12 09:45:52 Discover: IPv4 docker0: 172.17.0.1
2019-03-12 09:45:52 AddLocal(174.16.71.64:8099,1)
2019-03-12 09:45:52 Discover: IPv4 tunl0: 174.16.71.64
2019-03-12 09:45:52 dnsseed thread start
2019-03-12 09:45:52 opencon thread start
2019-03-12 09:45:52 dumpaddr thread start
2019-03-12 09:45:52 msghand thread start
2019-03-12 09:45:52 MultiChainMiner started
2019-03-12 09:45:52 addcon thread start
2019-03-12 09:45:52 ReacceptWalletTransactions: 0 txs in unconfirmed pool
2019-03-12 09:45:52 init message: Done loading
2019-03-12 09:45:52 net thread start
2019-03-12 09:45:52 Node started
2019-03-12 09:46:03 Loading addresses from DNS seeds (could take a while)
2019-03-12 09:46:03 0 addresses found from DNS seeds
2019-03-12 09:46:03 dnsseed thread exit
2019-03-12 09:46:20 mcapi: API request from ::ffff:127.0.0.1
2019-03-12 09:46:20 mcapi: API request: {"method":"liststreamkeys","params":["ipregister","000379230"],"id":"17301362-1552383980"}
2019-03-12 09:46:20 mcapi: API request successful: {"method":"liststreamkeys","id":"17301362-1552383980"}
2019-03-12 09:46:23 mcapi: API request from ::ffff:127.0.0.1
2019-03-12 09:46:23 mcapi: API request: {"method":"liststreamkeys","params":["ipregister"],"id":"14800657-1552383983"}

However, when I issue this command:

# ~/multichain/multichain-2.0-beta-3/multichain-cli ipregister -datadir=/data liststreamkeys ipregister
{"method":"liststreamkeys","params":["ipregister"],"id":"14800657-1552383983","chain_name":"ipregister"}

It does not respond, and if I restart the node, then it does not work anymore with the error above:

************************
EXCEPTION: St13runtime_error
ReserveKeyFromKeyPool() : read failed
bitcoin in AppInit()

2019-03-12 09:48:40 Shutdown: In progress...
2019-03-12 09:48:40 StopNode()
2019-03-12 09:48:40 Node stopped
2019-03-12 09:48:40 Shutdown: done
Thanks, will forward to the dev team and they'll reply here.
So we'd like to provide a place for you to upload the full blockchain directory. Can you please contact us by email via this form so we can set this up? https://www.multichain.com/contact-us/
More feedback. I started a new blockchain from scratch.

Imported almost 2M transactions into a unique stream with random attachments off-chain --> Worked.

Joined a new node with flag -txindex on start up. It started syncing up. But in the middle of the sync up I invoked its CLI to subscribe to the stream and be able to search. Suddenly, the sync up stopped and I got this in the logs:

UpdateTip:            new best=00abf711ac0f475b2164892c5e6b328aef4d05231791d8a5564dca020aa38f1a  height=4990  log2_work=20.285113  tx=495197  date=2019-03-12 14:35:07 progress=1.000000  cache=0
UpdateTip:            new best=0026f20798a9585082978165853e62de2d80c3ac7c75dc6b4d69364145f010b3  height=4991  log2_work=20.285402  tx=495213  date=2019-03-12 14:35:08 progress=1.000000  cache=0
UpdateTip:            new best=00710274991e85a9d841f9365fac9d929c65fdb2564e46f2a7c26e27aee4e7ee  height=4992  log2_work=20.285691  tx=495262  date=2019-03-12 14:35:09 progress=1.000000  cache=0
UpdateTip:            new best=0057be41c0971510eec77a649b2a9a014942bdbb4cd8901f662633f265793711  height=4993  log2_work=20.28598  tx=495276  date=2019-03-12 14:35:11 progress=1.000000  cache=0
UpdateTip:            new best=00486b682e83c23d4b974f82cff644fd893a0cf0aa33af25f69db897c2122793  height=4994  log2_work=20.286269  tx=495417  date=2019-03-12 14:35:18 progress=1.000000  cache=0
UpdateTip:            new best=002ef83a7e71c5b16e4ac45e0801e088e6921dae79aa4222604e0e14e2c195d6  height=4995  log2_work=20.286558  tx=495492  date=2019-03-12 14:35:20 progress=1.000000  cache=0
Still rescanning. At block 1159. Progress=1.000000
socket sending timeout: 61s

After a while, it seems to continue syncing up and the CLI responds. Is this the normal behaviour? I mean, while it is catching up with all the blocks, is it normal that any CLI command takes time to respond?

Next, once subscribed, I try to search something using the CLI command "liststreamkeys" and it replies, but I see the logs stop catching up for a second or so, and then throwing this message:

Timeout downloading block 0080e958ff3508948757d126b1f304b2c89599c375ec19fbe0f67f010d4ad522 from peer=19, disconnecting
mchn: Sending minimal parameter set to 10.138.132.7:53973
receive version message: /MultiChain:0.2.0.8/: version 70002, blocks=25749, us=10.138.132.8:8099, peer=20
mchn: Parameter set from peer=20 verified
ResendWalletTransactions()

This happens everytime I send a command through the CLI while the node is still catching up.

Is this expected behaviour then? Should I just wait for the node to catch up completely before issuing commands to its API?

Also I noticed that just after starting the new node up, it syncs blocks very quickly, but then slows down dramatically and logs show lots of:

Timeout downloading block 009a95010f7a300c3889ebcb5a062fe0139163a96ffe4c49fd55a84ab937fe0e from peer=24, disconnecting
We couldn't reproduce any surprising performance problems using your uploaded data directory. Of course, some API calls take a long time, but these are expected behaviors, e.g. subscribing to a stream with many items, or making an API request that has to access a large amount of data from the disk.

The main issue will be the fact that you are using network block storage instead of local disks (whether HDD or SSD). As with any database, some MultiChain operations have to make many disk accesses, and there will be a huge slowdown when using network storage. More tips on performance here:

https://www.multichain.com/developers/performance-optimization/
There has been a misunderstanding here: my latest tests, the ones I sent to you with a detailed step-by-step guide, have been performed using physical servers with local disks. No Kubernetes in the equation anymore.
That's understood, and the pauses you are seeing while issuing API commands that take a long time to respond are normal, and this can in turn lead to block downloading timeouts. The subscribe command is a one-time thing but for other APIs you should consider using the start/count parameters to prevent very long response times.
The queries I'm issuing are by key, so in theory, it should not impact whether I use the start/count parameters or not, should it?

But still, is it normal a 15 minutes response time of a query by key in a stream for a particular node when the master node always replies in ms?
Thanks for your follow-up - we think we may have found the underlying issue for the scenario you're experiencing - more soon.
That's great news. Looking forward to your feedback!

I was also installing it in AWS using m5d.xlarge "dedicated" instances with SSD disk and facing exactly the same issues: takes minutes to subscribe (without "-txindex") with the following logs:

Still rescanning. At block 3936. Progress=1.000000
ping timeout: 60.033212s
Still rescanning. At block 7544. Progress=1.000000
Still rescanning. At block 11047. Progress=1.000000
Still rescanning. At block 14706. Progress=1.000000
Still rescanning. At block 18658. Progress=1.000000
Still rescanning. At block 22610. Progress=1.000000
mchn: Sending minimal parameter set to 172.31.18.75:48946
receive version message: /MultiChain:0.2.0.8/: version 70002, blocks=25749, us=172.31.29.148:8099, peer=9
mchn: Parameter set from peer=9 verified
mchn: Synced with node 9 on block 25749 - requesting mempool

And then just after subscribing, I send a query by key and it replies immediately. Then I leave it running for a while, get back, issue a new query, and it takes some minutes to reply (which replies immediately in the master, though).
More feedback...

Left the AWS instances running for some hours and now when I query by key using the client node, again, it takes more than 15 minutes... And the master replies immediately...
Eventually everything has been solved. Amazing support.
Just a comment for any others following this discussion – the performance issue was caused by the stream having ~100 different offchain data payloads, each of which appears 1000+ times in the stream. These heavily duplicated (and therefore identically hashed) stream items are an unusual usage pattern but we've now resolved the performance issue in this case. The fix will be in the 2.0 production release.
...