Scaling blockchains with off-chain data

Posted June 13, 2018 by Gideon Greenspan in MultiChain, Private blockchains.

When a hash is worth a million words

By now it’s clear that many blockchain use cases have nothing to do with financial transactions. Instead, the chain’s purpose is to enable the decentralized aggregation, ordering, timestamping and archiving of any type of information, including structured data, correspondence or documentation. The blockchain’s core value is enabling its participants to provably and permanently agree on exactly what data was entered, when and by whom, without relying on a trusted intermediary. For example, SAP’s recently launched blockchain platform, which supports MultiChain and Hyperledger Fabric, targets a broad range of supply chain and other non-financial applications.

The simplest way to use a blockchain for recording data is to embed each piece of data directly inside a transaction. Every blockchain transaction is digitally signed by one or more parties, replicated to every node, ordered and timestamped by the chain’s consensus algorithm, and stored permanently in a tamper-proof way. Any data within the transaction will therefore be stored identically but independently by every node, along with a proof of who wrote it and when. The chain’s users are able to retrieve this information at any future time.

For example, MultiChain 1.0 allowed one or more named “streams” to be created on a blockchain and then used for storing and retrieving raw data. Each stream has its own set of write permissions, and each node can freely choose which streams to subscribe to. If a node is subscribed to a stream, it indexes that stream’s content in real-time, allowing items to be retrieved quickly based on their ordering, timestamp, block number or publisher address, as well as via a “key” (or label) by which items can be tagged. MultiChain 2.0 (since alpha 1) extended streams to support Unicode text or JSON data, as well as multiple keys per item and multiple items per transaction. It also added summarization functions such as “JSON merge” which combine items with the same key or publisher in a useful way.

Confidentiality and scalability

While storing data directly on a blockchain works well, it suffers from two key shortcomings – confidentiality and scalability. To begin with confidentiality, the content of every stream item is visible to every node on the chain, and this is not necessarily a desirable outcome. In many cases a piece of data should only be visible to a certain subset of nodes, even if other nodes are needed to help with its ordering, timestamping and notarization.

Confidentiality is a relatively easy problem to solve, by encrypting information before it is embedded in a transaction. The decryption key for each piece of data is only shared with those participants who are meant to see it. Key delivery can be performed on-chain using asymmetric cryptography (as described here) or via some off-chain mechanism, as is preferred. Any node lacking the key to decrypt an item will see nothing more than binary gibberish.

Scalability, on the other hand, is a more significant challenge. Let’s say that any decent blockchain platform should support a network throughput of 500 transactions per second. If the purpose of the chain is information storage, then the size of each transaction will depend primarily on how much data it contains. Each transaction will also need (at least) 100 bytes of overhead to store the sender’s address, digital signature and a few other bits and pieces.

If we take an easy case, where each item is a small JSON structure of 100 bytes, the overall data throughput would be 100 kilobytes per second, calculated from 500 × (100+100). This translates to under 1 megabit/second of bandwidth, which is comfortably within the capacity of any modern Internet connection. Data would accumulate at a rate of around 3 terabytes per year, which is no small amount. But with 12 terabyte hard drives now widely available, and RAID controllers which combine multiple physical drives into a single logical one, we could easily store 10-20 years of data on every node without too much hassle or expense.

However, things look very different if we’re storing larger pieces of information, such as scanned documentation. A reasonable quality JPEG scan of an A4 sheet of paper might be 500 kilobytes in size. Multiply this by 500 transactions per second, and we’re looking at a throughput of 250 megabytes per second. This translates to 2 gigabits/second of bandwidth, which is faster than most local networks, let alone connections to the Internet. At Amazon Web Services’ cheapest published price of $0.05 per gigabyte, it means an annual bandwidth bill of $400,000 per node. And where will each node store the 8000 terabytes of new data generated annually?

It’s clear that, for blockchain applications storing many large pieces of data, straightforward on-chain storage is not a practical choice. To add insult to injury, if data is encrypted to solve the problem of confidentiality, nodes are being asked to store a huge amount of information that they cannot even read. This is not an attractive proposition for the network’s participants.

The hashing solution

So how do we solve the problem of data scalability? How can we take advantage of the blockchain’s decentralized notarization of data, without replicating that data to every node on the chain?

The answer is with a clever piece of technology called a “hash”. A hash is a long number (think 256 bits, or around 80 decimal digits) which uniquely identifies a piece of data. The hash is calculated from the data using a one-way function which has an important cryptographic property: Given any piece of data, it is easy and fast to calculate its hash. But given a particular hash, it is computationally infeasible to find a piece of data that would generate that hash. And when we say “computationally infeasible”, we mean more calculations than there are atoms in the known universe.

Hashes play a crucial role in all blockchains, by uniquely identifying transactions and blocks. They also underlie the computational challenge in proof-of-work systems like bitcoin. Many different hash functions have been developed, with gobbledygook names like BLAKE2, MD5 and RIPEMD160. But in order for any hash function to be trusted, it must endure extensive academic review and testing. These tests come in the form of attempted attacks, such as “preimage” (finding an input with the given hash), “second preimage” (finding a second input with the same hash as the given input) and “collision” (finding any two different inputs with the same hash). Surviving this gauntlet is far from easy, with a long and tragic history of broken hash functions proving the famous maxim: “Don’t roll your own crypto.”

To go back to our original problem, we can solve data scalability in blockchains by embedding the hashes of large pieces of data within transactions, instead of the data itself. Each hash acts as a “commitment” to its input data, with the data itself being stored outside of the blockchain or “off-chain”. For example, using the popular SHA256 hash function, a 500 kilobyte JPEG image can be represented by a 32-byte number, a reduction of over 15,000×. Even at a rate of 500 images per second, this puts us comfortably back in the territory of feasible bandwidth and storage requirements, in terms of the data stored on the chain itself.

Of course, any blockchain participant that needs an off-chain image cannot reproduce it from its hash. But if the image can be retrieved in some other way, then the on-chain hash serves to confirm who created it and when. Just like regular on-chain data, the hash is embedded inside a digitally signed transaction, which was included in the chain by consensus. If an image file falls out of the sky, and the hash for that image matches a hash in the blockchain, then the origin and timestamp of that image is confirmed. So the blockchain is providing exactly the same value in terms of notarization as if the image was embedded in the chain directly.

A question of delivery

So far, so good. By embedding hashes in a blockchain instead of the original data, we have an easy solution to the problem of scalability. Nonetheless, one crucial question remains:

How do we deliver the original off-chain content to those nodes which need it, if not through the chain itself?

This question has several possible answers, and we know of MultiChain users applying them all. One basic approach is to set up a centralized repository at some trusted party, where all off-chain data is uploaded then subsequently retrieved. This system could naturally use “content addressing”, meaning that the hash of each piece of data serves directly as its identifier for retrieval. However, while this setup might work for a proof-of-concept, it doesn’t make sense for production, because the whole point of a blockchain is to remove trusted intermediaries. Even if on-chain hashes prevent the intermediary from falsifying data, it could still delete data or fail to deliver it to some participants, due to a technical failure or the actions of a rogue employee.

A more promising possibility is point-to-point communication, in which the node that requires some off-chain data requests it directly from the node that published it. This avoids relying on a trusted intermediary, but suffers from three alternative shortcomings:

It requires a map of blockchain addresses to IP addresses, to enable the consumer of some data to communicate directly with its publisher. Blockchains can generally avoid this type of static network configuration, which can be a problem in terms of failover and privacy.
If the original publisher node has left the network, or is temporarily out of service, then the data cannot be retrieved by anyone else.
If a large number of nodes are interested in some data, then the publisher will be overwhelmed by requests. This can create severe network congestion, slow the publisher’s system down, and lead to long delays for those trying to retrieve that data.

In order to avoid these problems, we’d ideally use some kind of decentralized delivery mechanism. Nodes should be able to retrieve the data they need without relying on any individual system – be it a centralized repository or the data’s original publisher. If multiple parties have a piece of data, they should share the burden of delivering it to anyone else who wants it. Nobody needs to trust an individual data source, because on-chain hashes can prove that data hasn’t been tampered with. If a malicious node delivers me the wrong data for a hash, I can simply discard that data and try asking someone else.

For those who have experience with peer-to-peer file sharing protocols such as Napster, Gnutella or BitTorrent, this will all sound very familiar. Indeed, many of the basic principles are the same, but there are two key differences. First, assuming we’re using our blockchain in an enterprise context, the system runs within a closed group of participants, rather than the Internet as a whole. Second, the blockchain adds a decentralized ordering, timestamping and notarization backbone, enabling all users to maintain a provably consistent and tamper-resistant view of exactly what happened, when and by whom.

How might a blockchain application developer achieve this decentralized delivery of off-chain content? One common choice is to take an existing peer-to-peer file sharing platform, such as the amusingly-named InterPlanetary File System (IPFS), and use it together with the blockchain. Each participant runs both a blockchain node and an IPFS node, with some middleware coordinating between the two. When publishing off-chain data, this middleware stores the original data in IPFS, then creates a blockchain transaction containing that data’s hash. To retrieve some off-chain data, the middleware extracts the hash from the blockchain, then uses this hash to fetch the content from IPFS. The local IPFS node automatically verifies the retrieved content against the hash to ensure it hasn’t been changed.

While this solution is possible, it’s all rather clumsy and inconvenient. First, every participant has to install, maintain and update three separate pieces of software (blockchain node, IPFS node and middleware), each of which stores its data in a separate place. Second, there will be two separate peer-to-peer networks, each with its own configuration, network ports, identity system and permissioning (although it should be noted that IPFS doesn’t yet support closed networks). Finally, tightly coupling IPFS and the blockchain together would make the middleware increasingly complex. For example, if we want the off-chain data referenced by some blockchain transactions to be instantly retrieved (with automatic retries), the middleware would need to be constantly up and running, maintaining its own complex state. Wouldn’t it be nice if the blockchain node did all of this for us?

Off-chain data in MultiChain 2.0

Today we’re delighted to release the third preview version (alpha 3) of MultiChain 2.0, with a fully integrated and seamless solution for off-chain data. Every piece of information published to a stream can be on-chain or off-chain as desired, and MultiChain takes care of everything else.

No really, we mean everything. As a developer building on MultiChain, you won’t have to worry about hashes, local storage, content discovery, decentralized delivery or data verification. Here’s what happens behind the scenes:

The publishing MultiChain node writes the new data in its local storage, slicing large items into chunks for easy digestion and delivery.
The transaction for publishing off-chain stream items is automatically built, containing the chunk hash(es) and size(s) in bytes.
This transaction is signed and broadcast to the network, propagating between nodes and entering the blockchain in the usual way.
When a node subscribed to a stream sees a reference to some off-chain data, it adds the chunk hashes for that data to its retrieval queue. (When subscribing to an old stream, a node also queues any previously published off-chain items for retrieval.)
As a background process, if there are chunks in a node’s retrieval queue, queries are sent out to the network to locate those chunks, as identified by their hashes.
These chunk queries are propagated to other nodes in the network in a peer-to-peer fashion (limited to two hops for now – see technical details below).
Any node which has the data for a chunk can respond, and this response is relayed to the subscriber back along the same path as the query.
If no node answers the chunk query, the chunk is returned back to the queue for later retrying.
Otherwise, the subscriber chooses the most promising source for a chunk (based on hops and response time), and sends it a request for that chunk’s data, again along the same peer-to-peer path as the previous response.
The source node delivers the data requested, using the same path again.
The subscriber verifies the data’s size and hash against the original request.
If everything checks out, the subscriber writes the data to its local storage, making it immediately available for retrieval via the stream APIs.
If the requested content did not arrive, or didn’t match the desired hash or size, the chunk is returned back to the queue for future retrieval from a different source.

Most importantly, all of this happens extremely quickly. In networks with low latency, small pieces of off-chain data will arrive at subscribers within a split second of the transaction that references them. And for high load applications, our testing shows that MultiChain 2.0 alpha 3 can sustain a rate of over 1000 off-chain items or 25 MB of off-chain data retrieved per second, on a mid-range server (Core i7) with a decent Internet connection. Everything works fine with off-chain items up to 1 GB in size, far beyond the 64 MB limit for on-chain data. Of course, we hope to improve these numbers further as we spend time optimizing MultiChain 2.0 during its beta phase.

When using off-chain rather than on-chain data in streams, MultiChain application developers have to do exactly two things:

When publishing data, pass an “offchain” flag to the appropriate APIs.
When using the stream querying APIs, consider the possibility that some off-chain data might not yet be available, as reported by the “available” flag. While this situation will be rare under normal circumstances, it’s important for application developers to handle it appropriately.

Of course, to prevent every node from retrieving every off-chain item, items should be grouped together into streams in an appropriate way, with each node subscribing to those streams of interest.

On-chain and off-chain items can be used within the same stream, and the various stream querying and summarization functions relate to both types of data identically. This allows publishers to make the appropriate choice for every item in a stream, without affecting the rest of an application. For example, a stream of JSON items about people’s activities might use off-chain data for personally identifying information, and on-chain data for the rest. Subscribers can use MultiChain’s JSON merging to combine both types of information into a single JSON for reading.

If you want to give off-chain stream items a try, just follow MultiChain’s regular Getting Started tutorial, and be sure not to skip section 5.

So what’s next?

With seamless support for off-chain data, MultiChain 2.0 will offer a big step forwards for blockchain applications focused on large scale data timestamping and notarization. In the longer term, we’re already thinking about a ton of possible future enhancements to this feature for the Community and/or Enterprise editions of MultiChain:

Implementing stream read permissions using a combination of off-chain items, salted hashes, signed chunk queries and encrypted delivery.
Allowing off-chain data to be explicitly “forgotten”, both voluntarily by individual nodes, or by all nodes in response to an on-chain message.
Selective stream subscriptions, in which nodes only retrieve the data for off-chain items with particular publishers or keys.
Using merkle trees to enable a single on-chain hash to represent an unlimited number of off-chain items, giving another huge jump in terms of scalability.
Pluggable storage engines, allowing off-chain data to be kept in databases or external file systems rather than local disk.
Nodes learning over time where each type of off-chain data is usually available in a network, and focusing their chunk queries appropriately.

We’d love to hear your feedback on the list above as well as off-chain items in general. With MultiChain 2.0 still officially in alpha, there’s plenty of time to enhance this feature before its final release.

In the meantime, we’ve already started work on “Smart Filters”, the last major feature planned for MultiChain 2.0 Community. A Smart Filter is a piece of code embedded in the blockchain which implements custom rules for validating data or transactions. Smart Filters have some similarities with “smart contracts”, and can do many of the same things, but have key differences in terms of safety and performance. We look forward to telling you more in due course.

Please post any comments on LinkedIn.

Technical details

While off-chain stream items in MultiChain 2.0 are simple to use, they contain many design decisions and additional features that may be of interest. The list below will mainly be relevant for developers building blockchain applications, and can be skipped by less technical types:

Per-stream policies. When a MultiChain stream is created, it can optionally be restricted to allow only on-chain or off-chain data. There are several possible reasons for doing this, rather than allowing each publisher to decide for themselves. For example, on-chain items offer an ironclad availability guarantee, whereas old off-chain items may become irretrievable if their publisher and other subscribers drop off the network. On the flip side, on-chain items cannot be “forgotten” without modifying the blockchain, while off-chain items are more flexible. This can be important in terms of data privacy rules, such as Europe’s new GDPR regulations.
On-chain metadata. For off-chain items, the on-chain transaction still contains the item’s publisher(s), key(s), format (JSON, text or binary) and total size. All this takes up very little space, and helps application developers determine whether the unavailability of an off-chain item is of concern for a particular stream query.
Two-hop limit. When relaying chunk queries across the peer-to-peer network, there is a trade-off between reachability and performance. While it would be nice for every query to be propagated along every single path, this can clog the network with unnecessary “chatter”. So for now chunk queries are limited to two hops, meaning that a node can retrieve off-chain data from any peer of its peers. In the smaller networks of under 1000 nodes that tend to characterize enterprise blockchains, we believe this will work just fine, but it’s easy for us to adjust this constraint (or offer it as a parameter) if we turn out to be wrong.
Local storage. Each MultiChain node stores off-chain data within the “chunks” directory of its regular blockchain directory, using an efficient binary format and LevelDB index. A separate subdirectory is used for the items in each of the subscribed streams, as well as those published by the node itself. Within each of these subdirectories, duplicate chunks (with the same hash) are only stored once. When a node unsubscribes from a stream, it can choose whether or not to purge the off-chain data retrieved for that stream.
Binary cache. When publishing large pieces of binary data, whether on-chain or off-chain, it may not be practical for application developers to send that data to MultiChain’s API in a single JSON-RPC request. So MultiChain 2.0 implements a binary cache, which enables large pieces of data to be built up over multiple API calls, and then published in a brief final step. Each item in the binary cache is stored as a simple file in the “cache” subdirectory of the blockchain directory, allowing gigabytes of data to also be pushed directly via the file system.
Monitoring APIs. MultiChain 2.0 alpha 3 adds two new APIs for monitoring the asynchronous retrieval of off-chain data. The first API describes the current state of the queue, showing how many chunks (and how much data) are waiting or being queried or retrieved. The second API provides aggregate statistics for all chunk queries and requests sent since the node started up, including counts of different types of failure.
Flush on publish. When publishing an off-chain item, MultiChain ensures that its local copy of the data is fully written (or “flushed”) to the physical disk drive before the transaction referencing that data is broadcast to the network. Otherwise, if the node was unlucky enough to lose power immediately after broadcasting the transaction, the off-chain data could be permanently lost. This isn’t an issue for MultiChain itself, since the delays between a chunk’s retrieval attempts grow automatically over time. But it could cause problems at the application level, where everyone knows of the existence of some data but nobody is able to find it.
Publishing performance. By flushing off-chain data to disk in this way, MultiChain can incur a performance penalty, since physical disks are slow. For example, a mid-range 7200 rpm hard drive can only perform around 100 random data writes per second, limiting in turn the rate at which an individual node can publish off-chain items. There are three possible workarounds for this problem. First and most simply, nodes can use a solid state device (SSD) drive instead of a regular hard drive, which supports 10,000s of random write operations per second. Second, multiple off-chain items can be published in a single transaction using the “createrawsendfrom” API. In this case, MultiChain writes all the off-chain data referenced by a transaction in a single disk operation. Finally, MultiChain can be configured not to flush off-chain data to disk before broadcasting the transaction which references it. Use this option with care.
Native currency integration. For use cases which require it, MultiChain has always offered the option of using a native currency on a blockchain to prevent transaction spam and/or incentivize block validators (“miners”). In these cases, transactions must offer miners a minimum fee which is proportional to their size in bytes, in order to be relayed and confirmed on the chain. This mechanism has been extended to allow off-chain spam to be prevented, by requiring a minimum additional fee per kilobyte of off-chain data referenced in a transaction.
Archive nodes. If a node wishes to subscribe to every stream, and therefore retrieve and store every off-chain item published, it can be configured to do so using the “autosubscribe” runtime parameter. Any such node will act as a backup for the entire network, guaranteeing that off-chain items will not be lost or unavailable, no matter which other nodes disappear. One can imagine third party companies offering this as a commercial service.

Full details of all the relevant API calls and parameters can be found on the MultiChain 2.0 preview page.

Scaling blockchains with off-chain data

Confidentiality and scalability

The hashing solution

A question of delivery

Off-chain data in MultiChain 2.0

So what’s next?

Technical details

Recent Posts

Categories

Archives

RSS feed