An article to understand the advantages and disadvantages of FPGA and GPU accelerated zero-knowledge proof computing

This article analyzes the MSM algorithm, elliptic curve point addition algorithm, Montgomery multiplication algorithm, etc. in detail, and compares the performance difference between GPU and FPGA in BLS12_381 curve point addition.

Written by: Star Li

Zero-knowledge proof technology is more and more widely used, such as privacy proof, calculation proof, consensus proof and so on. While looking for more and better application scenarios, many people have gradually discovered that zero-knowledge proof proves that performance is a bottleneck. The Trapdoor Tech team has been deeply researching zero-knowledge proof technology since 2019, and has been exploring efficient zero-knowledge proof acceleration solutions. GPU or FPGA are relatively common acceleration platforms currently on the market. This article starts with the calculation of MSM, and analyzes the advantages and disadvantages of FPGA and GPU accelerated zero-knowledge proof calculation.

TL;DR

ZKP is a technology with broad prospects for the future. More and more applications are adopting zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. At the same time, the computational performance of the ZKP proof is relatively poor. This paper analyzes the MSM algorithm, elliptic curve point addition algorithm, Montgomery multiplication algorithm, etc. in detail, and compares the performance difference between GPU and FPGA in BLS12_381 curve point addition. In general, in terms of ZKP proof computing, short-term GPU has obvious advantages, high throughput, high cost performance, programmability and so on. Relatively speaking, FPGA has certain advantages in power consumption. In the long run, there may be FPGA chips suitable for ZKP calculations, or ASIC chips customized for ZKP.

ZKP algorithm complex

ZKP is a general term for Zero Knowledge Proof technology (Zero Knowledge Proof). There are mainly two classifications: zk-SNARK and zk-STARK. The current common algorithms of zk-SNARK are Groth16, PLONK, PLOOKUP, Marlin and Halo/Halo2. The iteration of the zk-SNARK algorithm is mainly along two directions: 1/whether a trusted setup is needed 2/the performance of the circuit structure. The advantage of the zk-STARK algorithm is that no trusted setup is required, but the amount of verification calculation is log-linear.

As far as the application of the zk-SNARK/zk-STARK algorithm is concerned, the zero-knowledge proof algorithms used by different projects are relatively scattered. In the application of zk-SNARK algorithm, because the PLONK/Halo2 algorithm is universal (no need for trusted setup), there may be more and more applications.

PLONK proves the amount of computation

Taking the PLONK algorithm as an example, analyze the calculation amount of the PLONK proof.

The calculation amount of the PLONK proof part consists of four parts:

1/ MSM - Multiple Scalar Multiplication. MSM is often used to compute polynomial commitments.

2/ NTT Computation - Polynomial conversion between point value and coefficient representation.

3/ Polynomial calculation - polynomial addition, subtraction, multiplication and division. Polynomial evaluation (uation) and so on.

4/ Circuit Synthesize - circuit synthesis. The calculation of this part is related to the scale/complexity of the circuit.

Generally speaking, the amount of calculation in the Circuit Synthesize part is more judgment and loop logic, and the degree of parallelism is relatively low, which is more suitable for CPU calculation. Generally speaking, zero-knowledge proof acceleration generally refers to the calculation acceleration of the first three parts. Among them, the calculation amount of MSM is relatively the largest, followed by NTT.

What's MSM

MSM (Multiple Scalar Multiplication) refers to a given series of points and scalars on the elliptic curve, and calculates the points corresponding to the results of adding these points.

For example, given a series of points on an elliptic curve:

Given a fixed set of Elliptic curve points from one specified curve:

[G_1, G_2, G_3, ..., G_n]

and random coefficients:

and a randomly sampled finite field elements from specified scalar field:

[s_1, s_2, s_3, ..., s_n]

MSM is the calculation to get the Elliptic curve point Q:

Q = \sum_{i=1}^{n}s_i*G_i

The industry generally adopts the Pippenger algorithm to optimize the MSM calculation. Take a closer look at the schematic diagram of the process of Pippenger's algorithm:

The calculation process of the Pippenger algorithm is divided into two steps:

1/ Scalar split into Windows. If Scalar is 256bits, and a Window is 8bits, then all Scalars are divided into 256/8=32 Windows. Each layer of Window uses a "Buckets" to temporarily store intermediate results. GW_x is the point of accumulation result on one layer. Calculating GW_x is also relatively simple. It traverses each Scalar in a layer in turn, and uses the value of the Scalar layer as Index, and adds the corresponding G_x to the corresponding Buckets. In fact, the principle is relatively simple. If the coefficients of the addition of two points are the same, add the two points first and then add the Scalar again, instead of adding the two points twice before adding the Scalar.

2/ The points calculated by each Window are accumulated by double-add to get the final result.

Pippenger's algorithm also has many deformation optimization algorithms. In any case, the underlying calculation of the MSM algorithm is the addition of points on the elliptic curve. Different optimization algorithms correspond to different numbers of points added.

Elliptic curve point addition (Point Add)

You can look at various algorithms for point addition on elliptic curves with "short Weierstrass" form from this site.

Assuming that the Projective coordinates of the two points are (x1, y1, z1) and (x2, y2, z2) respectively, the result of point addition (x3, y3, z3) can be calculated by the following calculation formula.

The reason for giving the calculation process in detail is to show that the entire calculation process is mostly integer operations. The bit width of the integer depends on the parameters of the elliptic curve. Give the bit widths of some common elliptic curves:

  • BN256 - 256bits
  • BLS12_381 - 381bits
  • BLS12_377 - 377bits
  • Special attention is that these integer operations are operations on the modulo field. Modular addition/modulus subtraction is relatively simple, focus on the principle and implementation of modular multiplication.

模乘(Modular Muliplication)

Given two values over a modulo field: x and y. The modular multiplication calculation refers to x*y mod p. Note that the bit width of these integers is the bit width of the elliptic curve. The classic algorithm of modular multiplication is Montgomery multiplication (MontgomeryMulplication). Before performing Montgomery multiplication, the multiplicand needs to be converted to Montgomery representation:

The formula for calculating Montgomery multiplication is as follows:

There are many Montgomery multiplication implementation algorithms: CIOS (Coarsely Integrated Operand Scanning), FIOS (Finely Integrated Operand Scanning), and FIPS (Finely Integrated Product Scanning) and so on. This article does not introduce the details of various algorithm implementations in depth, and interested readers can do their own research.

In order to compare the performance difference between FPGA and GPU, choose the most basic algorithm implementation method:

Simply put, the modular multiplication algorithm can be further divided into two calculations: large number multiplication and large number addition. After understanding the calculation logic of MSM, you can choose the performance of modular multiplication (Throughput) to compare the performance of FPGA and GPU.

Observe and think

Under such FPGA design, it can be estimated that the entire VU9P can provide the throughput at BLS12_381 elliptic curve points. A point addition (add_mix way) requires about 12 modular multiplications. The system clock of FPGA is 450M.

Under the same modular multiplication/modulus addition algorithm, using the same point addition algorithm, the point plus Troughput of Nvidia 3090 (considering data transmission factors) exceeds 500M/s. Of course, the whole calculation involves a variety of algorithms, some algorithms may be suitable for FPGA, and some algorithms are suitable for GPU. The reason for using the same algorithm comparison is to compare the core computing capabilities of FPGA and GPU.

Based on the above results, summarize the comparison between GPU and FPGA in terms of ZKP proof performance:

Summarize

More and more applications are adopting zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. From our hands-on engineering experience, FPGA is an option, but GPU is currently a cost-effective option. FPGA prefers deterministic computing, which has the advantages of latency and power consumption. GPU has high programmability, has a relatively mature high-performance computing framework, and has a short development iteration cycle, and prefers throughput scenarios.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)