📝

Data availability for blockchain

Creator

Created

Feb 9, 2022 12:34 PM

Project

Data availability for Web3

Stage

No more action (or old)

Problem
Why it is interesting? / Applications / Motivations
Data availability proof using erasure codes
Related Works
References

Author: @ Data: January 2022

Problem

How can nodes with a partial view of a blockchain (eg, “light clients” downloading only the block header) be sure that when a new block is produced, all of the data (ie, transactions) in that block was actually published to the network?

Why it is interesting? / Applications / Motivations

Solving the data availability problem is need for enabling fraud proofs (needed in different scalability solutions such as shards and optimistic rollups)

”If a block producer doesn’t release all of the data in a block, no one could detect if there is a malicious transaction hidden within that block and produce the corresponding fraud proof”

and more in general in any application that uses a blockchain as a data layer (eg, celestia/LazyLedger).

“Reducing the problem of validating the blockchain to simply verifying that the contents of the block are available”.

A Proof of Data Availability is an interactive protocol among light clients and full nodes.

NETWORK ASSUMPTIONS:

full nodes communicate with each other, light clients communicate only with full nodes.
no honesty majority of consensus-participating nodes required
each light client is connected to at least one honest full node is on-line
[a minimum number of honest client in order to reconstruct the block is needed]

PROPERTIES:

soundness (“if an honest client accepts, a honest full nodes has the block”)
agreement (“if an honest client accepts, all the other honest clients accepts”)

Data availability proof using erasure codes

Solution based on erasure codes. The high-level idea is

The block producer encodes the block data of size k with an erasure code with reception efficiency q, getting E = enc(D) of size n > k and computes the MT root on E.

Note that, by definition of erasure code, we have that with q*n of data anyone can reconstruct the entire D.

Before accepting a block, the light clients download s (constant number) random components of E. If they can download all of those components and they open correctly to the MT root , then with probability 1−q^s there is actually enough data to reconstruct the whole block on the network. [Light client gossips the share to the full nodes they are connected]

Assumption: there’s enough honest light clients making sample requests so that enough share are released for reconstruction (see “selective share disclosure” attack below).

Possible research direction #1: remove/weaken this assumption?

“Selective Share Disclosure Attack”: A block producer selectively releases shares as light clients ask for them, up to q*n-1; clients will accept the blocks as available despite them being unrecoverable. From the paper: ”This is alleviated assuming an enhanced network model where a sufficient number of honest light clients make requests such that more than q*n -1 shares will be sampled, and that each sample request for each share is anonymous (i.e., sample requests cannot be linked to the same client) and the distribution in which every sample request is received is uniformly random, for example by using a mix net [9]. As the network would not be able to link different per-share sample requests to the same clients, shares cannot be selectively released on a per-client basis.”

As erasure code, Reed-Solomon code is used. Parse the data D in k blocks, each block is an field in $\mathbb{F}_p$ for some prime p. Compute the polynomial of degree k that such that $P(i) = D_i$ for i = 0, 1, ..., k-1. (Lagrange interpolation or FFT) E is defined as (D || P(k), ... , P(n-1)) (p ≥ n) With any set of k blocks from D’, we can fully recover D. (For example, n = 2k and q = 1/2)

There is a problem with the former basic protocol: the block producer could compute a wrong E. A simple but inefficient solution is: optimistic approach + honest full node send a fraud proof to light clients if something goes wrong. This is inefficient because it requires light clients to download the entire block in case of a fraud proof. A more efficient solution is proposed by [1] using bi-dimensional erasure codes. Their protocol goes like this:

The block producer encodes the block data of size k x k with an bi-dimensional code, getting a 2k x 2k matrix E and computes the MT root on E.

MT root on E:

first compute rowRoot_i and columnRoot_i for all i
root = MT root on (rowRoot_1, ..., columnRoot_1 ...)

The light clients receives the block header and rowRoot_1, ..., columnRoot_1 ... to check that root(rowRoot_1, ..., columnRoot_1 ...) = dataRoot in the header
The light client chooses a set of matrix entries (ie, coordinates (x,y)) and ask for those entries and inclusion paths. Each share and valid MT proof that is received by the light client is gossiped to all the full nodes that the light client is connected to;
If a full node has enough shares to recover a particular row or column, and after doing so detects that recovered data does not match its respective row or column root, then it must distribute a fraud proof consisting of enough shares in that row or column to be able to recover it, and a Merkle proof for each share.

⚠️

This protocol does not completely avoid the need of a honest full node to send the fraud proof of a wrong encoding but it makes the fraud proof more efficient: the light client has to download only a row/column to check the data (

O(\sqrt n)

instead of

n)

Research direction #2: other way to efficiently prove correct encoding?

Note: the original paper [1] suggests to look at SNARKs and at locally decodable codes + proximity checks.
[2] proposes a new protocol based in KZG commitment and RS codes that eliminates the need of invalid encoding fraud proofs. See section below (related works) for more info.

Two-dimensional Reed-Solomon Code Parse the data D in k^2 blocks and organize them in a k x k matrix (apply padding if necessary); Apply Reed-Solomon encoding with n = 2k on each row; Apply Reed-Solomon encoding with n = 2k on each column; Apply Reed-Solomon encoding on the “extended” rows; The result, E, is a 2k x 2k matrix.

Theorem: Given E, data is unrecoverable if at least k + 1 columns or rows have each at least k + 1 unavailable shares. In that case, the minimum number of shares that must be unrecoverable is (k + 1)^2.

Related Works

Nazirkhanova et al. - 2021 - Information Dispersal with Provable Retrievability [2]

The main application/motivation for this paper is verifiable retrievability of data stored off-chain (while the data availability problems considers data on-chain!)
No need of fraud proof, they use erasure codes and homomorphic VC (KZG)

Protocol:

Dispersal: Clients arranges the input data in a L x k matrix U and

commit to columns of U, h_i = KZG.Commit(column i)
apply RS.Encode to rows of U, C is the output matrix, L x n
send h_1, ... h_k and the column c_i of C to storage provider i
Local consistency check: storage provider i checks KZG.Commit(c_i) = RS.Encode([h_1, ... h_k ])_i and compute hash = hash(h_1, ... h_k)
The providers sends back the client sing(Hash, stored)
The clients collects q valid signatures from different providers (these together are the certificate of retrievability)

Retrieval: Retrieve a file based on a certificate of retrievability P for a commitment hash, the client first extracts any q unique storage nodes for which P contains a valid signature on (stored,hash). The client then requests the chunks of C from these storage nodes. The storage nodes reply with the commitments and chunks they have stored for hash. The client first waits until some commitments (h1,...,hk) satisfy hash = h(h1∥...∥hk). Then, the client discards any chunks that do not satisfy the homomorphic property. Upon receiving valid chunks from k unique storage nodes, the client uses Code.Decode to decode the file.
Parameters: q ≤ (n−t), 0 < (q−t) and k ≤ (q−t) are necessary.

So given any t < n/2 and target file size |B| (in field elements), choose q = (n − t), minimize storage overhead with k = n − 2t, and set L = |B|/k.

In Appendix A, they explain how to use their protocol for data availability proof.

The first idea is the following:

block producer: data input has k blocks, encode them in a polynomial p of degree k-1, and let C = KZG.Commit(p), add C to the block header
light nodes: ask for evaluation of p at random locations
“once evaluations at k distinct locations are available, the polynomial, and thus the block content, can be reconstructed. Any invalid transaction becomes visible and full nodes can generate corresponding fraud proofs.”
However schemes of this flavor however require to compute an evaluation witness for each sample, which can be computationally heavy.

So, for a more efficient protocol the block producer run the dispersal (commit to column of data and encode the rows). Light clients downloads the all the column commitments and a column c_i (L elements), they verify correctness by running the local consistency check (no fraud proofs!).

[todo?] Bi-dimensional KZG

Proposed in the context of “Danksharding” for Ethereum. Reference: https://ethresear.ch/t/2d-data-availability-with-kate-commitments/8081

[todo] Coded Merkle Trees [3]

References

[1] Al-Bassam et al. - 2019 - Fraud and Data Availability Proofs Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities.

[2] Nazirkhanova et al. - 2021 - Information Dispersal with Provable Retrievability

[3] Yu, M., Sahraei, S., Li, S., Avestimehr, S., Kannan, S., Viswanath, P.: Coded Merkle Tree: Solving data availability attacks in blockchains. FC 2020

[4] LazyLedger paper (no constructions, examples of applications that use a data layer and data availability proofs)