- Verifiable Database
- What is missing from existing solutions?
- Applications
- Make data stored on Filecoin to be query-able
- Blockchain state focused on applications instead of transactions (in particular Layer 2)
- More powerful IPLD-based applications
- Projects
- Project 1: MVP for simple queries on a spreadsheet
- Plan for Supported Queries
- Project 2: Verifiable SQL-friendly database
- Minimal set of supported SQL queries
- Joins of few tables
- Simple “Where” conditions and grouping:
- Simple updates
- Project 3: Queries on Linked Datasets
- More on the Technical Requirements
- Leads and backup options
TL;DR: we propose and plan research to move from “vector commitments” (which one can efficiently query on positions) to “Verifiable DBs (which one can query as if the committed data were DBs/spreadsheets/chains-of-JSON-objects). If we had them we could make data stored on Filecoin to be query-able, we could have layer 2 state focus more on applications, we could obtain more powerful applications on IPLD.
A note on naming: while “Verifiable DB” exists in literature (here), they involve a secret-key. By that phrase here we instead denote publicly verifiable objects. We may find a better name in the short-term future.
Verifiable Database
Vector commitments (aka VC, see e.g. here and here) are like “hashes” but with more: they allow us to make a digest of a file/string and then to check that specific positions have a certain value. While I can convince you that a hashed file has character “\n” at position 10 by sending you the whole file, with a vector commitment I could convince you with a “proof” that is much much shorter than that (”succinctly”).
One can see VC as a data structure where we query positions (the “10” in the example above) and receive a verified value corresponding to that position. Here we ask:
What if we could support more expressive queries than just positions?
As an example, if we committed to a table T could we receive a verifiable query response to, say,
SELECT Age, Name, Hobby WHERE Age ≥ 18 AND Name LIKE “M%"
Other example include:
- spreadsheet-like queries (where the committed data represent a spreadsheet)
- GraphQL queries
What is missing from existing solutions?
Clearly vector commitments are not expressive enough. This problem can be seen also as “SNARKs on committed data”. However, existing solution are still not practical: may still require “too large certificates” or too many resources from the prover.
Applications
Why are we working on this?
If Web3 is to replace Web2 then it needs to be able to handle large datasets and queries on those. For example: filtering all the movies that were liked by most users on Youtube? In Web2, that’s easy, but in Web3 this often requires a trusted third party to run a database themselves based on on-chain data
Make data stored on Filecoin to be query-able
Data that is stored on Filecoin currently is just a stream of bytes, it would be great if Storage Providers would be able to answer queries and that such answer would be certified. See Filecoin On-chain Storage Service for more.
Blockchain state focused on applications instead of transactions (in particular Layer 2)
- Problem 1: The Ethereum chain state is not SNARK/proofs friendly, nor is the state of most layer 2s.
- Problem 2: Most of the state of layer 1s and layer 2s (rollup such as Arbitrum) is focusing on transactions or contract interactions. If we could build a state that is database-first, then we could have both layer 1s and layer 2s to focus more on applications and less on transactions. (e.g. imagine a decentralized “hub” for code repo where we want to be able to query codebases with certain features. More examples here )
- Problem 3: There are some databases being built on top of Ethereum state such as Dune.xyz, TableLand, TheGraph and Sismo.eth. However, most of these databases are either centralized (such as Sune and Sismo) or based on majority assumptions for the outcome of the queries being correct (such as TheGraph).
A verifiable database would solve Problem 1, 2 and 3 and could be the data structure for an application specific Layer 2.
More powerful IPLD-based applications
IPLD is a network of linked hashed objects.
- Problem 1: The fact that these objects are currently linked via non-snark friendly hash function makes it not easy to prove statements around them
- Problem 2: The structure of IPLD data makes it difficult to do queries that are different from inclusion proofs
- Problem 3: IPLD only supports “classic hash”-based linking (where by “classic hash” we mean combinatorial machines such as SHA, that are not vector-friendly)
- Problem 4: IPLD does not have a system for querying in a provable way an IPLD graph (how to prove a chain of queries)
An easy fix to Problem 3 is to move from hashes to vector-commitments. To overcome problems 1 and 2 we may benefit from a verifiable database that supports richer queries. Problem 4 could be studied on its own as an extension.
See also:
Projects
Project 1: MVP for simple queries on a spreadsheet
Minimum Viable Product (MVP) that can show the power of better data structures, it will get our hands dirty on what are the core problems in designing more generic data structures.
The idea would be to develop a “verifiable spreadsheet” with reasonable efficiency. To approach this problem we could tweak existing techniques: see smart encodings of these queries as circuits or try to augment vector commitments with some clever encodings.
- Deliverable: an implementation of a scheme supporting a spreadsheet-like context (or at least a filter/map/reduce setting), relevant benchmarks and a related technical report.
- Timeline: end of Q3/Q4 2022
- DRI: Matteo Campanelli
- Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation
Plan for Supported Queries
As mentioned above a minimal requirement is a pipeline of the type
output = Reduce(f, Map(g, Filter(p, table)
where “table” is the committed table. Some additional details:
Predicate | Meaning |
Filter(p) | Filter elements by simple predicate p (e.g. a regex or a divisibility check) |
Map(g) | Map a vector element-wise through simple function g (e.g. a pattern substitution or a minimal polynomial) |
Reduce(f) | Reduce a vector of elements to a single elements repeatedly applying function f of arity 2 (e.g, f(x,y) = x+y sums all elements) |
Additionally, it would be useful to support “zipping” of two or more vectors before mapping them.
We note that, above, “output” may itself be a new cell/column; therefore, we could use this approach to support updates.
We leave as part of project work to investigate more meaningful queries for spreadsheets that we can support.
Project 2: Verifiable SQL-friendly database
Goal: Run generic SQL queries (or a subset of) in a verifiable fashion on a “verifiable database” for which there is a known commitment
Our intuition is that we can achieve this by structuring the data in a specific way and using specific authenticated data structure and a way to prove statements (either via SNARKs or functional commitments).
- Deliverable: an implementation of a scheme supporting a minimal set of SQL queries (see below) relevant benchmarks and a related technical report.
- DRI: Matteo Campanelli
- Timeline: End of Q1 2023
- Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation
Minimal set of supported SQL queries
The following example queries (or equivalent ones) should be supported. Here we propose to at least support SQL. We leave as part of project work to investigate to which extent we can support GraphQL (SQL may be a “lower hanging fruit” since there is already literature on it, albeit not yet practical, e.g. vSQL)
Joins of few tables
SELECT Shippers.ShipperName, COUNT(Orders.OrderID) AS NumberOfOrders FROM
Orders LEFT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID GROUP BY ShipperName;
Simple “Where” conditions and grouping:
SELECT n name, SUM(l extendedprice*(1-l discount))
AS revenue
FROM customer, orders, lineitem, supplier, nation, region
WHERE c custkey = o custkey AND l orderkey = o orderkey
AND l suppkey = s suppkey AND c nationkey = n nationkey
AND n regionkey = r regionkey AND r name = ’MIDDLE EAST’
AND o orderdate >= date ‘1997-01-01’
AND o orderdate < date ‘1997-01-01’+interval ’1’ year
GROUP BY n name
ORDER BY (revenue) DESC;
Simple updates
UPDATE Customers
SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
WHERE CustomerID = 1;
Project 3: Queries on Linked Datasets
This project description requires some familiarity with: IPLD and its notion of selectors.
This setting considers the existence of several databases (or tables, or graphs) that are linked with each other. Our starting point is the vision of IPLD: the objects we want to explore are structured as a DAG. For example, there is a large IPLD object linking to other IPLD objects and we want to query some of the linked objects.
Systems may also require to select (and possibly transform) data in an IPLD DAG. An advantage of IPLD is that selection can be done in one single stream. This is also true of recursive selections.
A problem with the current IPLD approach is that selection is not efficiently verifiable. To check a returned object satisfies a CID, I may need to compute a hash on its entirety. We want to avoid this.
Goal: understand how to extend IPLD to support queries (selection in an object and across objects) that can be verified in the same fashion as described in the rest of this document, i.e. more efficiently than reading the whole object if that is not necessary.
Querying on an IPLD DAG has additional challenges when compared to the SQL/spreadsheet setting. These also stem from the recursive nature of the queries. We see Projects 1 and 2 as a “warm up” projects for the IPLD setting with broad impact by themselves.
- Deliverables :
- a technical report describing challenges, cryptographic definitions and constructions for verifiable queries on linked data
- a proof-of-concept implementation
- DRI: Matteo Campanelli
- Timeline: End of Q2 2023
- Potential People: Rosario Gennaro, Dario Fiore, a student working on the implementation
More on the Technical Requirements
How do we define “practical” and “efficient”?
- proof should be very short (a few KBs for most relations)
- the verifier should be fast (few ms)
- the prover should run in a reasonable time (ideally order of handful of minutes for representative dataset/queries)
Zero-knowledge is not a requirement at the moment.
Leads and backup options
How do we plan to go about achieving this?
There are four avenues of attacks (we refer to the setting of Project 1 to some of the ideas below:
- “dumbing down” SNARKs
- an easy entry point is tweaking light-weight SNARKs, e.g. GKR-based ones and see how they perform.
- This is medium risk and it requires exploration
- “scaling up” vector commitments + SNARKs
- naive idea: can we commit to additional “hinting” information in a vector/polynomial commitment that can make proving easier?
- example:
- assume each entry in the file is a long piece of text
- assume we are interested in supporting prefix/suffix queries
- we can also commit to prefixes and suffices of the entry to avoid recomputing them at proving time
Backup plans: if we cannot successfully carry out any of the ideas above then we can:
- look at how straightforward circuit descriptions of simple queries would perform + using commit-and-prove techniques (e.g. LegoSNARK) to cherry pick the best performing existing SNARKs for each step of the pipeline, i.e. filtering/mapping/reducing
- recursive proof systems are rising and may offer a solution to this problem