# Motivation

A large majority of users onboard data onto the Filecoin network via an aggregator. Today the work done by aggregators is unverifiable and unprovable. The user relies on the aggregator to perform the work correctly and at the same time, it is impossible to prove to a third party that a given piece of data was included in a deal.

## Requirements

- Provable inclusion of Client’s data within Aggregator’s on-chain data deal
- Storage Provider is able to trivially find user’s data within the larger data segment
- Malicious behaviour of the aggregator or another user, whose data was aggregated, does not cause irrecoverable problems with retrieval.

# Specification

The proposal is split into three parts each of them addressing a part of the requirements.

- Data Segment format for sectors which facilitates inclusion proofs and indexing on the side of Storage Provider
- The Proof of Data Segment Inclusion within a sector or deal
- A format for Batched Merkle Inclusion Proofs
- Optionally a scheme for proving and verification of non-power-of-two aligned or length data segments

## Data Segments without out-of-band data

Today a sector carries a set of deals, with information about where which deal is situated stored out-of-band within Filecoin’s consensus layer. The Data Segment Format aims to allow for this data to be stored in-band, within the deal or sector data itself. The mechanism is independent of whether, in the future, Filecoin enables fully off-chain deals, where the deal maker and storage provider come to an agreement without involving Filecoin’s consensus, or the on-chain deal-making continues; it can be applied either on the sector or a deal level.

The primary goal is to inform the Storage Provider what data is stored where within the deal or sector in a verifiable way, without having to utilize the limited chain space.

### Format

A data segment is a sequence of data which is guaranteed the following properties: it is possible to verify its inclusion within the container and it can be trivially found in the container by the Storage Provider to facilitate data retrieval.

It is important to highlight that on-chain deals, as realised today, fulfil the definition of a data segment but use the chain to facilitate the execution of the second property.

The information that is required from the data segment index are: Commitment to the data segment and its offset within the container, not required but we also decide to store the length of the data segment in the index as well.

The data segment index is a constant-size area at the end of a container containing an array of fixed-sized entries corresponding to the information of data segments. The data segment index is a data structure that is specifically designed to work well with the existing commitment scheme. This structure will have to evolve when a new commitment

Each entry consists of:

- Commitment to the data segment (in the native format of the container, SHA256-truncated currently, without CID-prefix), 254bits
- Offset within the container, 64bits, either as byte offset within the container, or a fixed-point fraction of the size of the container (in Q1.63 format).
- Size of the data segment, 64bits, the same as the offset.
- Optionally: type descriptor, the forever debate of in-band vs out-of-band type description (Windows extensions vs Unix MIME type detection).
- A SHA-2 (or maybe blake2b-128 or keccak) hash of the above data, truncated to 126bits.

The commitment, offset and size facilitate the discovery of the data segment by the Storage Provider and are used by the Client or a third party to verify that the data segment of interest was properly indexed.

The truncated hash of that data is used for the discovery of valid entries within the data segment index. As the creation of the index is controlled by a third party, collision or preimage resistance are insignificant there. A lighter checksum function could be used in place of a cryptographic hash.

Each entry fits into 504 bits, which is twice the node size in the commitment used by PoRep v1.1. Each entry is aligned to the node double-node boundary to facilitate proving that the given data segment descriptor was included in the data segment index.

## Proof of Data Segment Inclusion

The proof consists of two inclusion proofs:

- An inclusion proof of a sub-tree corresponding to the tree of the client’s data, optionally this sub-tree can be miss-aligned with the proving tree.
- An inclusion proof of the double leaf data segment descriptor within the data segment index.

The client possesses the following information: Commitment to their data ($\mathrm{CommD}_C$) and size of their data ($|{\mathrm{D}_C}|$).

The aggregator inserts client’s data into the sector or deal, and also adds the data segment descriptor into data segment index.

Following that, the aggregator produces the data commitment and two inclusion proofs and provides them to the client:

- $\mathrm{CommD}_A$ - a commitment to the data created by the aggregator
- $|\mathrm{D}_A|$ - size of the aggregator’s data, corresponding to deal/sector size.
- $\mathrm{pos}_D$ - position of client’s data within the aggregator’s data
- $\mathrm{idx}_{ds}$ - index of data segment descriptor within the data segment index
- $\pi_{st}$ - sub-tree proof, proving the inclusion of $\mathrm{D}_C$ in $\mathrm{D}_A$ at position $\mathrm{pos}_D$
- $\pi_{ds}$- leaf inclusion proof, proving the inclusion of $(\mathrm{CommD}_A, \mathrm{pos}_D, |\mathrm{D}_C|, \mathrm{checksum})$ within $\mathrm{D}_A$ at $f(\mathrm{idx}_{ds}, |\mathrm{D}_A|)$

The $f(\mathrm{idx}, \mathrm{size})$ is a pre-agreed function mapping position within the data segment index to position within the sector.

Auxiliary information provided by the aggregator to the client are: ActorID of the SP which accepted the deal and the sector number.

To complete the verification of the proof, the client has to verify the two inclusion proofs as well as check that a given sector was on-boarded and $\mathrm{CommD}_A$ was size $|\mathrm{D}_A|$:

- $\mathrm{VerifyInclusion}(\mathrm{CommD}_C, |\mathrm{D}_C|, \mathrm{CommD}_A, |\mathrm{D}_A|, \pi_{st}, \mathrm{pos}_D)$
- $\mathrm{VerifyInclusion}(\mathrm{Comm}(\mathrm{CommD}_C, \mathrm{pos}_D, |\mathrm{D}_C|, \mathrm{checksum}), 2, \mathrm{CommD}_A, |\mathrm{D}_A|, \pi_{ds}, f(\mathrm{idx}_{ds}, |\mathrm{D}_A|))$
- Verify that $|\mathrm{D}_A|$ corresponds to on-chain data size
- Verify that $\textrm{CommD}_A$ was on-boarded in a sector.

## Batched Merkle Tree inclusion proofs

Merkle Tree inclusion proofs are the primary way we use for proving that given data is where it claimed to be. This can change in the future but Merkle Tree proofs will stay around to be used for other purposes and cross-chain capabilities.

The size of Merkle inclusion proof for 2-ary tree generally is: `sizeOfNode*(depthOfTree+1)`

, for the purpose of data inclusion we work with nodes of 32 bytes and trees 30 layers deep, resulting in the proof size of 992 bytes.

With no batching scheme, the proof size scales linearly with the number of provided proofs, but there exists common data when proving into the same tree root. This can provide significant savings when proofs are created and provided for nearby portions of the tree. The proof size saving can reach 90% which is expected for the data segment index area.

# Design Rationale

The design was guided by the three requirements.

An inclusion proof of the data itself is the simplest way to achieve a verifiable deal aggregation. While this guarantees data inclusion within a sector, it does not provide any guarantees around the ability of the Storage Provider of finding that data.

A possibly simpler than the proposed approach way to achieve the second requirement is a specially designed CAR structure and padding, rendering a user-provided CAR readable as part of a larger, deal- or sector-wide, CAR. This has a drawback of CAR structure being fragile and the discoverability of user’s data within that larger CAR is a function of all the previous data within it. This data could be adversarial in nature, making data from user CAR unretrievable.

This is what steers the design towards the proposed approach, data segment index provides the discoverability of user data within the container, and is specially designed to lend itself to producing proofs. There are still open questions in that design.

An entry in the data segment index could include a type descriptor of the data contained within the data segment. This would necessitate creation of registry of data segment types to allow Storage Providers to interpret the types within the index. An alternative to this approach is not to include a type descriptor and rely on type detection for identification. This is approach widely used within computing space, the only operating system relying on external type descriptor is Windows through extensions, and while extensions are used to choose which application should open a given file type, the applications frequently conduct their own type detection.

Another still uncertain choice is the checksum function for the data segment index entries. While the proposal currently mentions a cryptographic hash function, none of its cryptographic properties are used, a possible faster-to-compute checksum function or non-cryptographic hash function could be used as long they provide universality and uniformity. If such a checksum function were to be chosen, the computation cost within Filecoin’s WASM and FEVM runtimes should be evaluated and compared with hash functions available as syscalls and precompiles.