Testudo: Efficient SNARKs with Smaller Setups

Created by

Deleted User

Date

March 27, 2023

In this blog post, we present Testudo a new near linear-time $^*$ prover SNARK with the following advantages:

Small & Universal Setup: It uses a trusted setup of square-root size — i.e., for an R1CS of size $N$ , the trusted setup is of size $O(\sqrt{N})$ . For large circuits, this brings the trusted setup material to MBs rather than GBs.
Very fast prover: Estimated to run more than ~5x faster than fastest Groth16 implementation (i.e., Bellperson) for data-parallel computations.
Small Proofs & Fast Verifier: Constant size proofs and verification time.
Uses R1CS, the most widely used approach to writing circuits in deployed libraries. This gives us the following advantages:

Switching to Testudo allows us to reuse existing code.
It does not require rewriting circuits in a different arithmetization.

(*) Our prover runs $\sqrt{N}$ multi-exps of size $\sqrt{N}$ , which is roughly $O(N * \lambda/\log(N))$ group operations with $\lambda >> \log(N)$ for security.

Internals: At its core, our SNARK relies on three carefully combined building blocks:

a modified version of Spartan (sumcheck-based scheme)
a pairing-based multivariate commitment scheme (modified version of PST)
a final Groth16 layer.

Testudo has an ongoing implementation using arkworks with a blst integration with GPU support:

Call for participation: We welcome anyone interested to contribute to the Testudo implementation. The project tackles many challenging points and is at the state of the art when it comes to proving R1CS circuits. See the last section for more information!

Why the name Testudo? Testudo was a type of battle formation that ancient Rome adopted, where its soldiers operated “under the hood” of their shields. Testudo, the proof scheme, is similar: a Spartan prover woking under the hood of Groth16.

Context

Our initial motivation for developing Testudo was to improve the SNARKs used in Filecoin. Filecoin requires storage providers to prove to the whole network that they are holding the storage they had initially committed to. The circuit involved has $\approx 2^{30}$ constraints (one of the largest circuits used in practice today) and is verified by Groth16.

The computation is large enough to push current hardware to its limits: the big circuit is actually “broken down” into 10 subcircuits each of size $\approx 2^{27}$ , due to limitations on the maximum size of the trusted setup. Also, one issue with Groth16 is the function-specific trusted setup which complicates deployment of new versions of the Filecoin protocol (e.g., a new proof-of-space) since this would require a new circuit. While Filecoin is an interesting specific study-case, issues of this kind may also apply more generally to other deployed systems with similar requirements.

Therefore, our goals were as follows:

Universal trusted setup of sub-linear size
Faster (or at least comparable) prover time than Groth16
Short proofs and fast verification time
R1CS-based, to be “off-the-shelf” compatible with current FIL Proofs circuits
A SNARK that could leverage “data parallel computation” since FIL Proofs are largely data parallel

A bit more in depth

Recall that an R1CS system is defined by three $N \times N$ matrices, $A,B,C$ and we say that it is satisfiable if there exists a witness $w$ such that $\langle A \circ w \rangle \cdot \langle B \circ w \rangle - \langle C \circ w \rangle = \vec{0}$

Spartan Recap

At a high level, the Spartan prover consists of several steps as in the following:

Above

A,B,C

are the constraint matrices of the R1CS instance

High Level Testudo Description

Polynomial Commitments:

In a preprocessing phase, the prover encodes $A,B,C$ as sparse polynomials $\tilde{A}, \tilde{B}, \tilde{C}$ and commits to them via polynomial commitments (called computation commitments in Spartan). We note that for uniform circuits (i.e. very data-parallel, with many sub-circuits repeating in regular patterns), this step is not necessary or much reduced in complexity, since the Verifier can efficiently compute $\tilde{A}, \tilde{B}, \tilde{C}$ on their own (or the polynomials $\tilde{A},\tilde{B},\tilde{C}$ are much smaller than $A,B,C$ )
In the online phase the prover computes a multilinear extension $\tilde{w}$ of the witness and commits to it using any multivariate polynomial commitment scheme. While Spartan uses a discrete-log transparent scheme (Hyrax) we use our modification of PST described below.
Note that the polynomials are of size $O(N)$ here, corresponding to the number of R1CS constraints.

Sumchecks:

The Spartan prover executes two sumcheck protocols sequentially using $\tilde{A},\tilde{B},\tilde{C},\tilde{w}$ to prove the satisfiability equation of the R1CS constraints $\langle A \circ w \rangle \cdot \langle B \circ w \rangle - \langle C \circ w \rangle = \vec{0}$
The sumcheck verifier naively runs in time $O(\log N)$ . In Testudo we encode this verifier into an R1CS circuit and the prover provides a Groth16 proof of knowledge of the accepting Spartan sumcheck proof.

Opening of polynomial commitments: At the end of the sumchecks, the prover needs to show to the verifier the value of the polynomials $\tilde{A}, \tilde{B},\tilde{C}, \tilde{w}$ on a random point $\vec{x},\vec{y}$ (of size $O(\log N)$ )

🐌 To open the modified PST commitment for the witness, the prover operates on an $O(N)$ -sized polynomial, which requires several polynomial divisions to get the quotient polynomials necessary for the proof, plus $O(N)$ -sized exponentiations.
🐌 The prover also opens the Computation Commitments, which is where the Spartan prover spends a large chunk of its time. We recall however that on uniform circuits this step is not mandatory as verifier can open the matrix himself.
These openings can be either left in the “clear” for a $O(\log N)$ size proof (the PST opening of the witness polynomial) or also fed to a Groth16 prover for a constant size proof. We discuss the benefits and challenges of either approach below.

Testudo Details

In Testudo, we apply several optimizations to the blueprint above, both at prover and verifier level.

🚀 Moving to universal trusted setup: We note that above we require a trusted setup for

the modified PST polynomial commitments,
Groth16 on the sumcheck and polynomial commitment verification — we note that these circuits are of size $O(\log N)$ which is their contribution to the size of the trusted setup.

Both are independent of the specific R1CS being proven, yielding a universal trusted setup.

🚀 Reducing the trusted setup size: PST, like KZG, requires a trusted setup of size $O(N)$ . By observing a tensor product structure in the opening proof of PST, this can be reduced to $O(\sqrt{N})$ . In practice, this means going from GBs to MBs when loading/storing a setup. This is obtained by designing a version of PST that works together with products for inner product arguments (MIPP)

We briefly exemplify this on a polynomial defined by 4 evaluation points over the hypercube. Note how the PST.Commit steps only operates on a polynomial of half of the degree!

🚀 The verifier needs to verify (a) one inner-pairing product proof of size $O(\log(\sqrt{N}))$ and (b) one PST opening of the same size (vs $O(\log(N))$ before; constants matter 😉).
🚀 Faster Proving Times

🚀 Proving operates on $O(\sqrt{N})$ -sized polynomials now instead of $O(N)$ : so even if there are more polynomials, the proofs can be parallelised and are thus faster to compute. This holds in particular when it comes to the quotient polynomials which are smaller.
🚀 Large speed up for uniform computations: As an example, in Filecoin Proofs, we verify hundreds of Merkle tree openings. You can think of it as applying a subcircuit (”Verify one Merkle Tree opening proof”) many times over.

We now treat the “R1CS Matrix” as the matrix for the small subcircuit.
💡 In the sumcheck phase, we simply “concatenate” each witness (for each subcircuit) together. In theory, we should also concatenate the R1CS matrices together, but, since they are the same, the prover doesn’t have to do that in practice and we can leverage the specific details of the Spartan protocol here (a slightly more formal draft here)

Here is an example where we repeat the same circuit twice. Note that the prover only needs to commit and keep in memory the small blue matrix.

🚀 While the witness polynomial remains the same length, the computation commitment now only operates on the small circuit which is a significant speed-up

🚀 Fast Verification and Small Proof Size

🚀 Constant-time sumcheck verification: A sumcheck operates on finite fields only. Therefore, in Testudo, the prover encodes the verification of the sumcheck as a circuit and gives back a SNARK proof to the final verifier. Testudo uses Groth16 to implement this verifier, which still keeps the “universality” of Testudo because sumcheck verification doesn’t change with the user circuit.

🚀 Constant-time polynomial commitment opening verification: We can also apply a final proof system to compress the verification of the polynomial openings. These openings are of size $O(\log\sqrt{N})$ , which is small enough to run inside a circuit.

We call this the outer proof over the outer curve, see next section for more details.
🚀 This yields a constant size proof and constant time verifier!

Overall:

Testudo81 and Testudo77

Originally, we wanted to realize Testudo on BLS12-377 (hence, Testudo77) because of its nice 2-chain property. This enables us to use BW6 (the “outer curve”) to efficiently prove statements about elliptic curve operations natively. For example, to verify the PST opening inside a proof where one needs to do scalar multiplication and pairings.

However, Filecoin is operating on BLS12-381! That means, in order to introduce this proof system in Filecoin, we would require storage providers to re-encode their storage using the new curve, mostly because of Poseidon which is field-dependent (unlike SHA256 for example). Testudo81 is our version on Testudo that runs on BLS12-381. The main issue is that it is not possible to use Groth16 naively on top of BLS12-381 because any “outer curve” will lack high 2-adicity in its scalar field, required to compute FFTs efficiently (see Timofey’s post for more information about that).

The problem: How can we be backwards compatible, so that Filecoin storage providers don’t need to re-encode their storage?

Testudo77

This version is the simplest and most elegant solution if we can afford using these curves. Basically it runs an external Groth16 proof system on BW6 to verify both sumcheck and polynomial commitment openings.

🛗 Aggregation: An additional advantage is that we can use tools like Snarkpack to further aggregate Testudo77 proofs!

Outer proof constraints: We expect the number of constraints to be less than 10 millions which results in a few seconds of additional proving time using Groth16 (which is totally fine for our setting).

Testudo81

In this case since we cannot use Groth16 to compress the proof size and verification time, we could leave them in the clear for a $O(\log N)$ size proof. Alternatively we propose to use a subset of Testudo itself (i.e. the modified Spartan component which does not require FFT) to compress the polynomial commitment openings (using for example the Yeti curve as the “outer curve”). While this version achieves compatibility, it loses, however, in compactness and verification time—the “Spartan part” of Testudo is not constant-size (asymptotically this would be a $O(\log \log N)$ proof size/verification time, with some high constants).

Implementation

The work-in-progress implementation is open source on Github. Currently, it features:

The Groth16 verifier of the sumchecks
The square root version of PST + MIPP
Arkworks wrapper around the fast blst library with GPU integration (repo)

Benchmarks

Modified PST

One of our main contributions is a very fast implementation of a multilinear polynomial commitment, built on top of the existing arkworks library. We present a comparison between the two versions on a polynomial of size $2^{25}$

	Commitment (s)	Opening (s)	Verification (ms)	Proof (KB)	Committer Key
arkworks PST	34	184	14	2 KB	9.6 GB
testudo PST	24	1.2	32	17 KB	2.3 MB

Verification time: Note that this time has more than doubled. It is due to non optimized implementation. For example, the PST and MIPP verification are happening in isolation, i.e. the pairings are evaluated separately instead of together. Multiple other optimization have to be implemented to verify both parts together.

Estimation using uniform circuits

We have the necessary building blocks to estimate accurately the proving time of a uniform circuits (even though the implementation does not yet offer that feature).

Specifically, we need to add the time for

The first sumcheck on the full R1CS matrix (SC1)
The second sumcheck on the small subcircuit (SC2)
The PST times on the full witness size - commitment and opening combined (PST/MIPP)
The computation commitment time on the small subcircuit (CC)

We have all this for a subcircuit of size $2^{20}$ repeated $128 = 2^{7}$ times, giving a circuit of size $2^{27}$ constraints.

R1CS	PST/MIPP (s) Comm + Opening	SC1 (s)	SC2 (s)	CC (s)
$2^{27}$	47 + 3	105	107	3944
$2^{20}$	1.17 + 0.454	0.843	0.773	30

Total Proving Time: PST( $2^{27}$ ) + SC1( $2^{27}$ ) + SC2( $2^{20}$ ) + CC( $2^{20}$ ) = 185s

Comparison $2^{27}$ R1CS Bellperson: 1020s

Speedup factor: 5.1x 🚀

💡 As you can see, the CC is the most expensive part. There are many improvements to be done at this early stage that can drastically reduce the proving time there. We remind the reader that for uniform circuits this cost can be eliminated however.

Comparison with Plonk-ish techniques

When it comes to universal trusted setup proofs, many systems today do not use R1CS but rather “custom gates” (sometimes also called Plonkish arithmeization), and apply SNARKs such as Plonk (or alternatives such as Hyperplonk) to the resulting constraint systems. The use of “custom gates” makes a comparison to pure R1CS-based schemes not immediate. We are still working on achieving meaningful comparisons but we estimate that Testudo is competitive with approaches that do use custom gates.

Open problems

We have yet to solve still a few problems down the road:

☄️Efficient “streaming” aggregation: Snarkpack allows a prover to only keep Groth16 proofs locally and then aggregate them all at once. This is very light in memory, even for a large number of proofs.

For Testudo81, can we achieve the same functionality and performance as Snarkpack? See our draft for our current thinking and problem.
For Testudo77, can we achieve a faster aggregation? Instead of “fully” proving a Testudo proof, can we keep some small intermediate steps and finalize the last parts of Testudo over all proofs?

📚 Compact proof on Testudo81: Some open questions are:

How to achieve a compact proof using Testudo81?
Can we use another proof system, with non-native arithmetic in a practical way?
Can we use a mixed FFT algorithm to go around the 2-adicity problem?

🚀 Faster computation commitment and CRS for it: Currently, the computation commitment is still a bottleneck compared to the rest (when not using data parallelism). One area to explore is taking advantage of the PST commitment scheme here and batch it with the PST of the witness.

Different Polynomial Commitments: Another design option would be to choose Dory as it could simplify the implementation of the Computation Commitment. In general, being agnostic to the polynomial commitment would help to try different strategies.

Call for participations

We’re looking for enthusiastic engineers to help us push this effort forward. We believe that making this in the open is gonna give the best results. This is a complex piece of software and the structure behind is challenging.

There are both design and engineering challenges that are left open. On the implementation side, some of the items we would love to work on are:

⚙️ A R1CS circuit building layer: Basically bringing a “ConstraintSystem” that enables development of complex R1CS circuits. The original Spartan test codebase only created R1CS manually.
⛓️ Data-parallel Testudo that greatly reduces the proving time over uniform circuits. This can be a game changer for Merkle tree path verification for example.
🏎️ GPU-optimized schemes: For example, bringing the whole PST commitment to GPU at once, bringing the computation commitment to the GPU, and even look at parallel computation of the sumcheck on the GPU.
📦 A simplified, and more compact computation commitment (CC): By moving CC to use PST for example we can apply several optimizations combined with the PST commitment computed for the witness. We can also reduce the size of the preprocessing data required during proving time, which can easily attain TB for circuits of size $2^{30}$ .

If you want to help, please reach out on t he discord server of cryptonet or email us here! Feel free to discuss over Twitter.

Acknowledgements

We thank Srinath Setty for helpful pointers about the Spartan codebase that unlocked us many times!

Team

The Testudo effort was started inside the cryptonet team. The main Testudo team is composed of: Matteo Campanelli (cryptonet), Nicolas Gailly (cryptonet), Rosario Gennaro (cryptonet/CCNY), Philipp Jovanovic (UCL), Mara Mihali (formerly UCL/cryptonet), Justin Thaler (a16z/Georgetown).