Verifiable Data Queries for Web3
📊

Verifiable Data Queries for Web3

TL;DR: we propose and plan research to move from “vector commitments” (which one can efficiently query on positions) to “Verifiable DBs (which one can query as if the committed data were DBs/spreadsheets/chains-of-JSON-objects). If we had them we could make data stored on Filecoin to be query-able, we could have layer 2 state focus more on applications, we could obtain more powerful applications on IPLD.

A note on naming: while “Verifiable DB” exists in literature (here), they involve a secret-key. By that phrase here we instead denote publicly verifiable objects. We may find a better name in the short-term future.

Verifiable Database

Vector commitments (aka VC, see e.g. here and here) are like “hashes” but with more: they allow us to make a digest of a file/string and then to check that specific positions have a certain value. While I can convince you that a hashed file has character “\n” at position 10 by sending you the whole file, with a vector commitment I could convince you with a “proof” that is much much shorter than that (”succinctly”).

One can see VC as a data structure where we query positions (the “10” in the example above) and receive a verified value corresponding to that position. Here we ask:

What if we could support more expressive queries than just positions?

As an example, if we committed to a table T could we receive a verifiable query response to, say,

SELECT Age, Name, Hobby WHERE Age ≥ 18 AND Name LIKE “M%"

Other example include:

  • spreadsheet-like queries (where the committed data represent a spreadsheet)

  • GraphQL queries

What is missing from existing solutions?

Clearly vector commitments are not expressive enough. This problem can be seen also as “SNARKs on committed data”. However, existing solution are still not practical: may still require “too large certificates” or too many resources from the prover.

Applications

Why are we working on this?

If Web3 is to replace Web2 then it needs to be able to handle large datasets and queries on those. For example: filtering all the movies that were liked by most users on Youtube? In Web2, that’s easy, but in Web3 this often requires a trusted third party to run a database themselves based on on-chain data

Make data stored on Filecoin to be query-able

Data that is stored on Filecoin currently is just a stream of bytes, it would be great if Storage Providers would be able to answer queries and that such answer would be certified. See

for more.

Blockchain state focused on applications instead of transactions (in particular Layer 2)

  • Problem 1: The Ethereum chain state is not SNARK/proofs friendly, nor is the state of most layer 2s.
  • Problem 2: Most of the state of layer 1s and layer 2s (rollup such as Arbitrum) is focusing on transactions or contract interactions. If we could build a state that is database-first, then we could have both layer 1s and layer 2s to focus more on applications and less on transactions. (e.g. imagine a decentralized “hub” for code repo where we want to be able to query codebases with certain features. More examples here )
  • Problem 3: There are some databases being built on top of Ethereum state such as Dune.xyz, TableLand, TheGraph and Sismo.eth. However, most of these databases are either centralized (such as Sune and Sismo) or based on majority assumptions for the outcome of the queries being correct (such as TheGraph).

A verifiable database would solve Problem 1, 2 and 3 and could be the data structure for an application specific Layer 2.

More powerful IPLD-based applications

IPLD is a network of linked hashed objects.

  • Problem 1: The fact that these objects are currently linked via non-snark friendly hash function makes it not easy to prove statements around them
  • Problem 2: The structure of IPLD data makes it difficult to do queries that are different from inclusion proofs
  • Problem 3: IPLD only supports “classic hash”-based linking (where by “classic hash” we mean combinatorial machines such as SHA, that are not vector-friendly)
  • Problem 4: IPLD does not have a system for querying in a provable way an IPLD graph (how to prove a chain of queries)

An easy fix to Problem 3 is to move from hashes to vector-commitments. To overcome problems 1 and 2 we may benefit from a verifiable database that supports richer queries. Problem 4 could be studied on its own as an extension.

See also:

Projects

Project 1: MVP for simple queries on a spreadsheet

Minimum Viable Product (MVP) that can show the power of better data structures, it will get our hands dirty on what are the core problems in designing more generic data structures.

The idea would be to develop a “verifiable spreadsheet” with reasonable efficiency. To approach this problem we could tweak existing techniques: see smart encodings of these queries as circuits or try to augment vector commitments with some clever encodings.

  • Deliverable: an implementation of a scheme supporting a spreadsheet-like context (or at least a filter/map/reduce setting), relevant benchmarks and a related technical report.
  • Timeline: end of Q3/Q4 2022
  • DRI: Matteo Campanelli
  • Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation

Plan for Supported Queries

As mentioned above a minimal requirement is a pipeline of the type

output = Reduce(f, Map(g, Filter(p, table) 

where “table” is the committed table. Some additional details:

Predicate
Meaning
Filter(p)
Filter elements by simple predicate p (e.g. a regex or a divisibility check)
Map(g)
Map a vector element-wise through simple function g (e.g. a pattern substitution or a minimal polynomial)

Additionally, it would be useful to support “zipping” of two or more vectors before mapping them.

We note that, above, “output” may itself be a new cell/column; therefore, we could use this approach to support updates.

We leave as part of project work to investigate more meaningful queries for spreadsheets that we can support.

Project 2: Verifiable SQL-friendly database

Goal: Run generic SQL queries (or a subset of) in a verifiable fashion on a “verifiable database” for which there is a known commitment

Our intuition is that we can achieve this by structuring the data in a specific way and using specific authenticated data structure and a way to prove statements (either via SNARKs or functional commitments).

  • Deliverable: an implementation of a scheme supporting a minimal set of SQL queries (see below) relevant benchmarks and a related technical report.
  • DRI: Matteo Campanelli
  • Timeline: End of Q1 2023
  • Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation

Minimal set of supported SQL queries

The following example queries (or equivalent ones) should be supported. Here we propose to at least support SQL. We leave as part of project work to investigate to which extent we can support GraphQL (SQL may be a “lower hanging fruit” since there is already literature on it, albeit not yet practical, e.g. vSQL)

Joins of few tables

SELECT Shippers.ShipperName, COUNT(Orders.OrderID) AS NumberOfOrders FROM 
  Orders LEFT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID GROUP BY ShipperName;

Simple “Where” conditions and grouping:

SELECT n name, SUM(l extendedprice*(1-l discount))
 AS revenue
 FROM customer, orders, lineitem, supplier, nation, region
 WHERE c custkey = o custkey AND l orderkey = o orderkey
 AND l suppkey = s suppkey AND c nationkey = n nationkey
 AND n regionkey = r regionkey AND r name = ’MIDDLE EAST’
 AND o orderdate >= date ‘1997-01-01’
 AND o orderdate < date ‘1997-01-01’+interval ’1’ year
 GROUP BY n name
 ORDER BY (revenue) DESC;

Simple updates

UPDATE Customers
SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
WHERE CustomerID = 1;

Project 3: Queries on Linked Datasets

This project description requires some familiarity with: IPLD and its notion of selectors.

This setting considers the existence of several databases (or tables, or graphs) that are linked with each other. Our starting point is the vision of IPLD: the objects we want to explore are structured as a DAG. For example, there is a large IPLD object linking to other IPLD objects and we want to query some of the linked objects.

Systems may also require to select (and possibly transform) data in an IPLD DAG. An advantage of IPLD is that selection can be done in one single stream. This is also true of recursive selections.

A problem with the current IPLD approach is that selection is not efficiently verifiable. To check a returned object satisfies a CID, I may need to compute a hash on its entirety. We want to avoid this.

Goal: understand how to extend IPLD to support queries (selection in an object and across objects) that can be verified in the same fashion as described in the rest of this document, i.e. more efficiently than reading the whole object if that is not necessary.

Querying on an IPLD DAG has additional challenges when compared to the SQL/spreadsheet setting. These also stem from the recursive nature of the queries. We see Projects 1 and 2 as a “warm up” projects for the IPLD setting with broad impact by themselves.

  • Deliverables :
    • a technical report describing challenges, cryptographic definitions and constructions for verifiable queries on linked data
    • a proof-of-concept implementation
  • DRI: Matteo Campanelli
  • Timeline: End of Q2 2023
  • Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation

More on the Technical Requirements

How do we define “practical” and “efficient”?

  • proof should be very short (a few KBs for most relations)
  • the verifier should be fast (few ms)
  • the prover should run in a reasonable time (ideally order of handful of minutes for representative dataset/queries)

Zero-knowledge is not a requirement at the moment.

Leads and backup options

How do we plan to go about achieving this?

There are four avenues of attacks (we refer to the setting of Project 1 to some of the ideas below:

  • “dumbing down” SNARKs
    • an easy entry point is tweaking light-weight SNARKs, e.g. GKR-based ones and see how they perform.
    • This is medium risk and it requires exploration
  • “scaling up” vector commitments + SNARKs
    • naive idea: can we commit to additional “hinting” information in a vector/polynomial commitment that can make proving easier?
    • example:
      • assume each entry in the file is a long piece of text
      • assume we are interested in supporting prefix/suffix queries
      • we can also commit to prefixes and suffices of the entry to avoid recomputing them at proving time

Backup plans: if we cannot successfully carry out any of the ideas above then we can:

  • look at how straightforward circuit descriptions of simple queries would perform + using commit-and-prove techniques (e.g. LegoSNARK) to cherry pick the best performing existing SNARKs for each step of the pipeline, i.e. filtering/mapping/reducing
  • recursive proof systems are rising and may offer a solution to this problem