Not able to recover/start a node.

+1 vote

Hi Team,

I 've a 2 peer multichain system and I was running few performance tests on the system by generating 20 rps and I was using the sendassetfrom api to another single address. So my node was holding only 2 wallets (watch only) one for the sender and one for the receiver. The autocombine settings was the default one, so my node was generating a lot of utxo(s). 

Now eventually my node crashed and since then I 'm not able to recover my genesis node which had the admin permission. My 2nd node is up but it didn't have the admin permissions. So now I can't add other peers to my multichain since I can't grant permissions from the 2nd node. 

 

Initially my nodes were running on m4.large instance and when it crashed, I thought of bringing the node up by upgrading the node to bigger instances, I even tried with c4.8x large system for that multichain but the node never recovered. 

When I started the daemon on the big machine it basically hanged for ever. Logs see below link.

https://gist.github.com/ashish235/7ad9aaf8f0311577b6cf4d87754200ba 

Then I though of decreasing the utxo from the 2nd node, so I triggered autocombine manually and brought the utxo to just 18 or 19. Then I tried again to start the primary, this time after a good 20 mins I got the message node started but my blocks never got synced to each other. 

See the logs below. 

https://gist.github.com/ashish235/b85691a309d1bd10761ad49f1cc49cd3

Even then there was no transactions happening on the nodes, my blockcount on the 2nd node kept on increasing and my primary node just couldn't catch it, it was always 200 blocks behind. I left the system overnight and check in the morning, the difference of the synced blocks was same and all I could see the logs on primary is this. Please see below. I got about a million at least of this invalid flag message. When I checked on both the nodes this txid didn't exist, maybe this is not txid. :)

2017-11-28 19:21:59 Tx 35b807faa046809a3708f9fcea81b87ffab12a98bcbc2a43ddd00d4c1dec7934 was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx ecb81a6495211e3435a5cee6a11c3026fc943456e3752bbccd35f07ab1afaf2a was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx dd94407d874182acdc3176d54fd2377793b8bacb4c82a2a65715fd3b914aaaf0 was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx 5c733ed7c495d2e5396b0f38d6348dc35b570ce599534d14e96425caf8e4244e was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx 1d267fa07e6b1fe67670c955c3f11c44070e66467005f85f76f01eb82fa3352d was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx 8acf5bb55a082565625c3df9ca072ec9db5b74cb5331bb1233b8d80322474839 was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx 432a0f1c48453315b3d36bfb9defc822bc28c9b58f730023630c568ae4ee47d6 was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx 10852321116e5bcd25d2a7bf3b39aec262598f4226044ba90c7025b9b565d18d was not accepted to mempool, setting INVALID flag
2017-11-28 19:21:59 Tx f3ca1390b3faf35b4fe42a35c3521c0b796e454a3a5552f74fd63eda45e3d346 was not accepted to mempool, setting INVALID flag
mchn: Sending minimal parameter set to 172.31.17.70:35518
2017-11-28 19:21:59 receive version message: /MultiChain:0.1.0.9/: version 70002, blocks=4742, us=172.31.28.155:4411, peer=20
2017-11-28 19:21:59 mchn: Connection from 13zFHBxiYKbVBdefyiDG3A8EkqDUJa8xabRQLq received on peer=20 in verack
2017-11-28 19:21:59 mchn: Sending minimal parameter set to 172.31.17.70:4411
2017-11-28 19:21:59 receive version message: /MultiChain:0.1.0.9/: version 70002, blocks=4742, us=172.31.28.155:55716, peer=21
2017-11-28 19:21:59 mchn: Connection from 13zFHBxiYKbVBdefyiDG3A8EkqDUJa8xabRQLq received on peer=20 in verackack (172.31.17.70:35518)
2017-11-28 19:21:59 mchn: Parameter set from peer=20 verified
2017-11-28 19:21:59 ResendWalletTransactions()

 

In between all of this one strange thing also happened which I think is worth mentioning during the heavy load testing, my multichain was once not responding so I stopped multichain and it said 

[ec2-user@ip-172-31-28-155 ~]$ multichain-cli chain1 stop
{"method":"stop","params":[],"id":1,"chain_name":"chain1"}

error: no response from server

So I killed the daemon myself. 

 

 Please any pointers on how can I avoid such issues and how to get my primary node back? Let me know if I can help with any other information to help debug this. 

 

Thanks.

asked Nov 30, 2017 by ashish235

1 Answer

0 votes

We think the issue is that there are tons of invalid unconfirmed transactions left over from previous forks when you were doing performance testing that lead to too many unspent outputs. Please try the following steps to fix the problem:

  1. Stop the problematic node (using the stop command in the API).
  2. Make a backup of the blockchain directory, in case something goes wrong in the next steps.
  3. Remove all files that match the pattern uncsend*.dat in the wallet subdirectory of the blockchain directory.
  4. Restart the node.

Please let us know if this works. If something goes wrong you can always stop the node and restore the backup that you made in step 2 above.

answered Dec 1, 2017 by MultiChain
got the exact same problem on my performance tests after a while.
restarting a node resulted in these error messages. will try if the solution workes for me.
...