zkMIPS Beta: A Competitive Performance Report
Share on

Overview and Rationale

In this work, we begin to publish a general and fair zkVM benchmark framework based on previous work by a16z, providing a comparison on proving time and energy cost between ZKM (zkMIPS) and other zkVM projects, like RISC Zero (R0) and SP1.

This initial publication focuses on comparing proving time with R0, with the result showing that the performance of zkMIPS is close in a range 76% to 317%, and we also provide analysis about the slow cases and upcoming optimizations.

We aim for zkMIPS to become one of the most production-ready zkVM’s on the market and the most performant to leverage the MIPS instruction set.

Metrics

We seek to investigate the following questions:

  1. General-Purpose VM Use Cases: Evaluating which proof system is best for general-purpose VM use.
  2. Faster Proofs: Comparing ZKM vs R0 to determine which implementation offers the faster proof generation.

Methodology

Note: By presenting the specific instance type and benchmarks, we implemented the zkVM-benchmark (https://github.com/zkMIPS/zkvm-benchmarks) based on a16z/zkvm-benchmark, updating R0 to the latest v1.0.5 to ensure the results were directly comparable and fair.

For pure CPU machines, zkVM can still utilize Intel’s AVX instruction set to speed up the Goldilocks operations. This feature can help achieve a 6-10% speedup, based on our previous experiences. zkVM enabled this feature during the benchmark by adding runtime flags RUSTFLAGS="-C target-cpu=native".

Benchmarks

All benchmarks can be found at https://github.com/zkMIPS/zkvm-benchmarks

  • sha2 and sha2-chain: the sample sha2 digests different sizes of bytes and the sample sha2-chain runs n-times sha2 function with fixed-size(32-byte) input.
  • sha3 and sha3-chain: the sample sha3 digests different sizes of bytes and the sample sha2-chain runs n-times sha3 function with fixed-size(32-byte) input.
  • Fibonacci: classic Fibonacci sequence calculation with different length 
  • bigmem: allocate an 128000-byte array, and use block_box to avoid the compiler's dead code elimination optimization.
  • Rust EVM: A Rust EVM implementation by Bluealloy at commit

bef9edd. (https://github.com/zkMIPS/zkm/tree/main/prover/examples/revme )

Environment

CPU Instance: AWS r6a.8xlarge, 32 vCPU and 256G RAM, AMD EPYC 7R13 Processor

GPU Instance: 64-vCPU,480G RAM, AMD EPYC 9354 32-Core Processor, NV GeForce RTX 4090X4

CPU Instance

segment size of zkMIPS is 262144(2^21).

SHA2:

SHA2-chain:

SHA3:

SHA3-chain:

bigmem:

Fibonacci:

Summary

From this comparison, we can see that zkMIPS is competitive with other top-tier zkVMs regarding the proving performance and energy cost.

Furthermore, the analysis of our time distribution over the proof generation is depicted in Figure 1:

Figure 1: time/cost distribution for a single segment

Observing the different segments, we found that it takes up about 10%-22% of the total time to generate the traces, and 25%-32% to compute the commitment of each table, but the CTL (cross table lookup, which is built on GKR-optimized LogUp scheme) time is negligible.

By design, the traces’ generation should only be run on a single CPU, but the commitments calculation for different tables, like CPU, Arithmetic etc, can be executed in parallel on multiple CPUs, which reduces 2/3 of the time usage from now.

The proof generation takes up to about 45%. Regarding the proof calculation, we have a detailed pie chart below in Figure 2 - memory operation and cpu operations take about 72.8% of the total time, it matches along with the trace table’s size. 

Figure 2: time/cost distribution for a proof calculatio

For each STARK proof generation, the time usage distribution is shown in the tables 4,5 & 6 below. It’s easy to observe that the ‘compute auxiliary polynomials commitment’ and ‘compute openings proof’ are the most time consuming to compute. The ‘compute auxiliary polynomial commitment’ includes Poseidon-based Merkel Hash and the FFT to calculate the point-value format polynomial from coefficient polynomials. And ‘compute openings proof’ is calculating the final polynomial on the opening points by polynomial multiplication and division(inversion) operations.

Table 4: time usage for computing single STARK proof

Compute single STARK proof Time used(s) Ratio
Compute lookup helper columns 2.562 3.38%
Compute auxiliary polynomials commitment 16.8139 22.21%
Compute quotient polys 7.0605 9.33%
Split quotient polys 0.1736 0.23%
Compute quotient commitment 10.5241 13.90%
Compute openings proof 38.5683 50.95%


Table 5: time usage for computing compute auxiliary polynomials commitment

Compute auxiliary polynomials commitment Time used(s) Ratio
IFFT 0.9216 5.68%
FFT + blinding 3.6157 22.28%
Transpose LDEs 0.2759 1.70%
Build Merkle tree 11.4121 70.34%


Table 6: time usage for compute openings proof

Compute openings proof Time used(s) Ratio
Reduce batch 24.145 66.06%
Perform final FFT 8.72 23.86%
Fold codewords in the commitment phase 3.6842 10.08%

Closing

Differentiating from ZKM’s use of Plonky2, in which the FFT and Poseidon are performed over field Goldilocks, RISC0 and SP1 are using a more efficient hash function and smaller field, which can greatly benefit the proving time. We have implemented GPU Plonky2 and realized a speedup of 3. Meanwhile, we are looking to integrate the more efficient hash function and smaller field on Plonky2 in-place, expecting this to reduce the time and cost by roughly half. 

With the planned optimizations, we’re confident that we can significantly increase the performance of zkMIPS to be evermore closely competitive with other leading zkVMs. 

We have strived to be as accurate as possible in these benchmark comparisons. If any discrepancies are identified, please reach out to us at contact@zkm.io, and we will address any necessary corrections.

More articles
Hello World - August Newsletter
House of ZK held ‘ZK Day’ at The Science of Blockchain Conference in New York City. This event, consisting of technical workshops and debates centered around the latest innovations in ZK, was co-hosted by ZKM and Aleo, and co-organized by IC3, Stanford CBR, and Berkeley RDI. Notable experts like Jeroen van de Graaf, Lucas Fraga, and Alex Pruden shared their insights about topics such as Entangled Rollups, zkMIPS architecture, and BitcoinL2 zkVM-Powered Native Security, before the day concluded with a lively Happy Hour, offering a perfect opportunity to network and relax after a day of intensive discussions.
Hello World: December Newsletter
ZKM is excited to relaunch our Monthly Hello World Newsletter! We understand the importance of staying connected with our community, and what better way to do so than through the relaunch of our newsletter? The ZKM team has been diligently working behind the scenes to bring you a newsletter that not only informs but also engages with a little razzle dazzle.
zkMIPS Beta: A Competitive Performance Report

Overview and Rationale

In this work, we begin to publish a general and fair zkVM benchmark framework based on previous work by a16z, providing a comparison on proving time and energy cost between ZKM (zkMIPS) and other zkVM projects, like RISC Zero (R0) and SP1.

This initial publication focuses on comparing proving time with R0, with the result showing that the performance of zkMIPS is close in a range 76% to 317%, and we also provide analysis about the slow cases and upcoming optimizations.

We aim for zkMIPS to become one of the most production-ready zkVM’s on the market and the most performant to leverage the MIPS instruction set.

Metrics

We seek to investigate the following questions:

  1. General-Purpose VM Use Cases: Evaluating which proof system is best for general-purpose VM use.
  2. Faster Proofs: Comparing ZKM vs R0 to determine which implementation offers the faster proof generation.

Methodology

Note: By presenting the specific instance type and benchmarks, we implemented the zkVM-benchmark (https://github.com/zkMIPS/zkvm-benchmarks) based on a16z/zkvm-benchmark, updating R0 to the latest v1.0.5 to ensure the results were directly comparable and fair.

For pure CPU machines, zkVM can still utilize Intel’s AVX instruction set to speed up the Goldilocks operations. This feature can help achieve a 6-10% speedup, based on our previous experiences. zkVM enabled this feature during the benchmark by adding runtime flags RUSTFLAGS="-C target-cpu=native".

Benchmarks

All benchmarks can be found at https://github.com/zkMIPS/zkvm-benchmarks

  • sha2 and sha2-chain: the sample sha2 digests different sizes of bytes and the sample sha2-chain runs n-times sha2 function with fixed-size(32-byte) input.
  • sha3 and sha3-chain: the sample sha3 digests different sizes of bytes and the sample sha2-chain runs n-times sha3 function with fixed-size(32-byte) input.
  • Fibonacci: classic Fibonacci sequence calculation with different length 
  • bigmem: allocate an 128000-byte array, and use block_box to avoid the compiler's dead code elimination optimization.
  • Rust EVM: A Rust EVM implementation by Bluealloy at commit

bef9edd. (https://github.com/zkMIPS/zkm/tree/main/prover/examples/revme )

Environment

CPU Instance: AWS r6a.8xlarge, 32 vCPU and 256G RAM, AMD EPYC 7R13 Processor

GPU Instance: 64-vCPU,480G RAM, AMD EPYC 9354 32-Core Processor, NV GeForce RTX 4090X4

CPU Instance

segment size of zkMIPS is 262144(2^21).

SHA2:

SHA2-chain:

SHA3:

SHA3-chain:

bigmem:

Fibonacci:

Summary

From this comparison, we can see that zkMIPS is competitive with other top-tier zkVMs regarding the proving performance and energy cost.

Furthermore, the analysis of our time distribution over the proof generation is depicted in Figure 1:

Figure 1: time/cost distribution for a single segment

Observing the different segments, we found that it takes up about 10%-22% of the total time to generate the traces, and 25%-32% to compute the commitment of each table, but the CTL (cross table lookup, which is built on GKR-optimized LogUp scheme) time is negligible.

By design, the traces’ generation should only be run on a single CPU, but the commitments calculation for different tables, like CPU, Arithmetic etc, can be executed in parallel on multiple CPUs, which reduces 2/3 of the time usage from now.

The proof generation takes up to about 45%. Regarding the proof calculation, we have a detailed pie chart below in Figure 2 - memory operation and cpu operations take about 72.8% of the total time, it matches along with the trace table’s size. 

Figure 2: time/cost distribution for a proof calculatio

For each STARK proof generation, the time usage distribution is shown in the tables 4,5 & 6 below. It’s easy to observe that the ‘compute auxiliary polynomials commitment’ and ‘compute openings proof’ are the most time consuming to compute. The ‘compute auxiliary polynomial commitment’ includes Poseidon-based Merkel Hash and the FFT to calculate the point-value format polynomial from coefficient polynomials. And ‘compute openings proof’ is calculating the final polynomial on the opening points by polynomial multiplication and division(inversion) operations.

Table 4: time usage for computing single STARK proof

Compute single STARK proof Time used(s) Ratio
Compute lookup helper columns 2.562 3.38%
Compute auxiliary polynomials commitment 16.8139 22.21%
Compute quotient polys 7.0605 9.33%
Split quotient polys 0.1736 0.23%
Compute quotient commitment 10.5241 13.90%
Compute openings proof 38.5683 50.95%


Table 5: time usage for computing compute auxiliary polynomials commitment

Compute auxiliary polynomials commitment Time used(s) Ratio
IFFT 0.9216 5.68%
FFT + blinding 3.6157 22.28%
Transpose LDEs 0.2759 1.70%
Build Merkle tree 11.4121 70.34%


Table 6: time usage for compute openings proof

Compute openings proof Time used(s) Ratio
Reduce batch 24.145 66.06%
Perform final FFT 8.72 23.86%
Fold codewords in the commitment phase 3.6842 10.08%

Closing

Differentiating from ZKM’s use of Plonky2, in which the FFT and Poseidon are performed over field Goldilocks, RISC0 and SP1 are using a more efficient hash function and smaller field, which can greatly benefit the proving time. We have implemented GPU Plonky2 and realized a speedup of 3. Meanwhile, we are looking to integrate the more efficient hash function and smaller field on Plonky2 in-place, expecting this to reduce the time and cost by roughly half. 

With the planned optimizations, we’re confident that we can significantly increase the performance of zkMIPS to be evermore closely competitive with other leading zkVMs. 

We have strived to be as accurate as possible in these benchmark comparisons. If any discrepancies are identified, please reach out to us at contact@zkm.io, and we will address any necessary corrections.