Read restrictions, partial indexing, selective retrieval and purging

MultiChain’s streams functionality provides extensive support for blockchain use cases which focus on data storage, timestamping and retrieval. Stream items are published within blockchain transactions, creating an append-only tamper-proof database in which every item is ordered, timestamped and has a known author.

In both MultiChain Community and Enterprise, each stream item can contain JSON, text or binary data, up to 64 MB in size (on-chain) or 1 GB (off-chain), with one or more labels (“keys”) for easy retrieval. Each stream has its own permissions, and can have custom Javascript rules for data validation (stream filters). For more information about streams, including the difference between on-chain and off-chain data, click here.

MultiChain Enterprise offers a number of additional features relating to streams:

  • Per-stream read permissions, to control which nodes can see which data. In a read-permissioned stream, all data is off-chain.
  • Real-time replication of stream data to an external database via feeds.
  • End-to-end encryption of the delivery of read-restricted data over the peer-to-peer network.
  • Selective stream indexing, to control how a node indexes each stream.
  • Selective data retrieval, to control which off-chain items a node retrieves.
  • Selective purging of off-chain data, for compliance with right-to-be-forgotten regulations.

The first two items are introduced elsewhere. Per-stream read permissions are explained in the Getting Started guide and external database replication is covered by the Feed Adapter documentation. The third item, end-to-end encryption, takes place automatically and requires no user action. So this tutorial will focus on the last three items – selective stream indexing, selective data retrieval and purging of off-chain data.

The tutorial requires two servers running MultiChain Enterprise or Enterprise Demo, both of which are connected to the same blockchain. Both servers should be running multichain-cli for that chain in interactive mode. If you don’t yet have this set up, download and install MultiChain Enterprise Demo on two servers, follow the instructions in sections 1 and 2 of the Getting Started guide and then run multichain-cli chain1 on both servers. The first server’s node should also have an address with admin and create permissions – this will automatically be the case if it started the chain.

Selective indexing

Let’s begin by creating a stream which is open to all for writing, beginning on the node which has create permissions:

create stream stream3 true

Now, let’s publish two items to this stream, where the data is on-chain, i.e. fully embedded as part of each transaction:

publish stream3 key1 '{"json":{"name":"John","city":"London"}}'
publish stream3 key2 '{"json":{"name":"Mary","city":"New York"}}'

If a node tries retrieving data from the stream before subscribing to it, it gives an error. To see an example, try this on the second server:

liststreamitems stream3

So let’s now subscribe to the stream, but index it in the most minimal way possible, only tracking the ordering of the items on the blockchain:

subscribe stream3 true items
liststreams stream3

This second command shows that only the items index is active. This allows the most basic stream queries, for example:

liststreamitems stream3

However there are many other stream querying commands that will give an error, for example:

liststreamkeys stream3   (listing all keys in the stream)
liststreamkeyitems stream3 key1   (listing all items for this key)
liststreampublishers stream3   (listing all publishers)
liststreamitems stream3 false 10 0 true   (listing items in order first seen by this node)

So let’s now add an additional index for this stream:

subscribe stream3 true keys
liststreams stream3

The second command shows that both the items and keys indexes are now active. Now the first two of the previous queries will work fine:

liststreamkeys stream3
liststreamkeyitems stream3 key1

However, the other two querying commands still don’t work, since they would require the publishers and items-local indexes respectively:

liststreampublishers stream3
liststreamitems stream3 false 10 0 true

So selective indexing has an obvious disadvantage – it only allows streams to be queried in certain ways. Nonetheless, this comes with a corresponding advantage – faster transaction processing and lower disk usage. So if you know in advance how your application will need to query a stream, you can use selective indexing to improve MultiChain’s performance and reduce its system requirements.

Now let’s stop building the keys index for the stream, using the following command on the second server:

trimsubscribe stream3 keys

If we try running a key-related query again, we’ll be back to receiving an error:

liststreamkeys stream3

So indexes are added using the subscribe command, and removed using trimsubscribe. Below is the full list of indexes a node can build for a stream, along with the corresponding APIs which require each index to be present:

  • itemsrequired for every subscription and used by liststreamitems.
  • keys – used by liststreamkeys, liststreamkeyitems and getstreamkeysummary.
  • publishers – used by liststreampublishers, liststreampublisheritems and getstreampublishersummary.
  • items-local – used by liststreamitems if local ordering is specified.
  • keys-local – used by liststreamkeys and liststreamkeyitems if local ordering is specified.
  • publishers-local – used by liststreampublishers and liststreampublisheritems if local ordering is specified.

Selective retrieval and purging

Now let’s publish two off-chain items, where the blockchain only contains the key(s) and a hash, while the data itself is stored and delivered separately. On the first server:

publish stream3 '["key1","key2"]' '{"json":{"name":"Alex","city":"Tokyo"}}' offchain
publish stream3 key3 '{"json":{"name":"Sam","city":"Toronto"}}' offchain

While we’re on the first server, let’s also subscribe to the stream, indexing items only:

subscribe stream3 true items

Now let’s take a look at the two most recently published items in the stream, on both servers:

liststreamitems stream3 false 2

Note that, in both cases, the publishers and keys of the item are shown, and the items are labelled as offchain. However, on the first server, the data is available and can be seen in the output, whereas on the second server the data is unavailable.

The reason for this difference is as follows: The off-chain data was published on the first server, which stored a local copy of that data at the time of publication. Therefore, this data is available to the first server when querying the stream. However, the second server has not yet retrieved the off-chain data, so it is not available when querying.

Now, if the second server had subscribed to the stream with a regular non-selective subscribe stream3 command, then it would automatically retrieve every stream item’s off-chain data from the network, at the instant the item is published. This would also happen if the second server had used the parameter items,retrieve instead of just items in the subscribe command. However, by creating a selective stream subscription which excluded the retrieve keyword, the node will only retrieve off-chain data when told to do so.

Let’s confirm the automatic retrieval status of the stream subscription on the second server, noting the retrieve value shown:

liststreams stream3

So first let’s manually retrieve one of the missing off-chain items, by matching on the key key1. On the second server:

subscribe stream3 true keys   (to build the index by key)
retrievestreamitems stream3 '{"key":"key1"}'

This command returns the number of items that matched the given criterion, as well as their total size and the number of chunks (blocks of data) they contain. As explained in the API documentation, the items to be retrieved can also be specified by publisher (if the required index is present), transaction ID, block range or timestamp.

Now let’s confirm that the off-chain data for this item (with key key1) is now available on the second server, whereas the off-chain data for the second item (with key3) is still not:

liststreamitems stream3 false 2

Let’s say we now want the second server to forget the off-chain data again. We can do so as follows:

purgestreamitems stream3 '{"key":"key1"}'

Again, the command returns the number of items that matched the given criterion, as well as the total chunks and size of data that were purged. If you run the command again, you’ll see that there’s still one match, but nothing more to be purged. Now we can check that the data is unavailable again:

liststreamitems stream3 false 2

Note that the data has also been scrubbed from this node’s disk storage, using the wiping method controlled by the purgemethod runtime parameter (by default, overwriting with zeroes).

Now let’s configure the second server to automatically retrieve the off-chain data for all new published items going forwards:

subscribe stream3 true retrieve

Now on the first server, let’s publish a new off-chain item:

publish stream3 key4 '{"json":{"name":"Jamie","city":"Mumbai"}}' offchain

Let’s see what’s going on, back on the second server:

liststreamitems stream3 false 3

Note that it has automatically retrieved the off-chain data for the most recently published stream item. However it didn’t retrieve the data for the previous ote,s, which were published before the subscription was set to auto-retrieve. To retrieve the off-chain data for those items, we would still need to use the retrievestreamitems command.

Until now, a copy of the off-chain data has always remained on the first server, where it was published. If we want to forget some data from the network completely and permanently, we need to do two things. First, we ensure that every node which has retrieved the data from the network has purged its local copy, using purgestreamitems on that node. However, as a matter of design safety, this command will have no effect on the node which originally published the data. If we want to remove the data from the original publishing node, we need to use a different command purgepublisheditems.

First, do the following on either node to identify the transaction ID of the stream item with key3:

liststreamitems stream3 false 1 -2

Copy the txid here:

Now, on the first server, run the following command:

purgepublisheditems

As before, the command returns the number of items that matched the given criterion, as well as the total chunks and size of data that were purged. As explained in the API documentation, the items to be purged can also be specified by block range, timestamp or all to purge all.

Now, on both servers, we can see that the data is no longer available:

liststreamitems stream3 false 1 -2

We can try retrieving the data for this stream item using the following command on both nodes:

retrievestreamitems stream3

But it won’t help, since the data is completely lost. We can even see that each node continues to try retrieving it from the network, without success:

getchunkqueueinfo

(To prevent overload, MultiChain uses exponential backoff to gradually reduce the rate at which each piece of data is sought.)