Table of Contents
Fetching ...

MTU: The Multifunction Tree Unit for Accelerating Zero-Knowledge Proofs

Jianqiao Mo, Alhad Daftardar, Joey Ah-Kiow, Kaiyue Guo, Benedikt Bünz, Siddharth Garg, Brandon Reagen

TL;DR

This paper analyzes binary-tree workloads central to zero-knowledge proofs and introduces MTU, a hardware accelerator designed to speed up SumCheck- and Merkle-tree kernels. It compares BFS, DFS, and a hardware-friendly Hybrid Traversal, demonstrating that MTU can deliver substantial speedups (up to ~1478× against CPU at DDR-level bandwidth and up to ~9440× at high bandwidth) while reducing memory traffic. The authors provide a detailed architectural design of MTU, including a DFS Accumulator and scalable PE-based fabric, and perform a thorough design-space study across bandwidth, area, and power. The results offer practical guidance for building modular, SOC-friendly ZKP accelerators capable of handling large binary-tree workloads with shared polynomial commitment foundations. This work advances hardware-software co-design for ZKPs by highlighting traversal-aware optimizations and compact, reusable accelerator building blocks.

Abstract

Zero-Knowledge Proofs (ZKPs) are critical for privacy-preserving techniques and verifiable computation. Many ZKP protocols rely on key kernels such as the SumCheck protocol and Merkle Tree commitments to enable their key security properties. These kernels exhibit balanced binary tree computational patterns, which enable efficient hardware acceleration. Although prior work has investigated accelerating these kernels as part of an overarching ZKP protocol, exploiting this common tree pattern remains relatively underexplored. We conduct a systematic evaluation of these tree-based workloads under different traversal strategies, analyzing performance on multi-threaded CPUs and the Multifunction Tree Unit (MTU) hardware accelerator. We introduce a hardware-friendly Hybrid Traversal for binary tree that improves parallelism and scalability while significantly reducing memory traffic on hardware. Our results show that MTU achieves up to $1478\times$ speedup over CPU at DDR-level bandwidth and that our hybrid traversal outperforms breadth-first search by up to $3\times$. These findings offer practical guidance for designing efficient hardware accelerators for ZKP workloads with binary tree structures.

MTU: The Multifunction Tree Unit for Accelerating Zero-Knowledge Proofs

TL;DR

This paper analyzes binary-tree workloads central to zero-knowledge proofs and introduces MTU, a hardware accelerator designed to speed up SumCheck- and Merkle-tree kernels. It compares BFS, DFS, and a hardware-friendly Hybrid Traversal, demonstrating that MTU can deliver substantial speedups (up to ~1478× against CPU at DDR-level bandwidth and up to ~9440× at high bandwidth) while reducing memory traffic. The authors provide a detailed architectural design of MTU, including a DFS Accumulator and scalable PE-based fabric, and perform a thorough design-space study across bandwidth, area, and power. The results offer practical guidance for building modular, SOC-friendly ZKP accelerators capable of handling large binary-tree workloads with shared polynomial commitment foundations. This work advances hardware-software co-design for ZKPs by highlighting traversal-aware optimizations and compact, reusable accelerator building blocks.

Abstract

Zero-Knowledge Proofs (ZKPs) are critical for privacy-preserving techniques and verifiable computation. Many ZKP protocols rely on key kernels such as the SumCheck protocol and Merkle Tree commitments to enable their key security properties. These kernels exhibit balanced binary tree computational patterns, which enable efficient hardware acceleration. Although prior work has investigated accelerating these kernels as part of an overarching ZKP protocol, exploiting this common tree pattern remains relatively underexplored. We conduct a systematic evaluation of these tree-based workloads under different traversal strategies, analyzing performance on multi-threaded CPUs and the Multifunction Tree Unit (MTU) hardware accelerator. We introduce a hardware-friendly Hybrid Traversal for binary tree that improves parallelism and scalability while significantly reducing memory traffic on hardware. Our results show that MTU achieves up to speedup over CPU at DDR-level bandwidth and that our hybrid traversal outperforms breadth-first search by up to . These findings offer practical guidance for designing efficient hardware accelerators for ZKP workloads with binary tree structures.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Dataflow for computing Build MLE (output size $2^5$). The result level (blue) is indexed as Level 1. Partitioning into four subtrees is highlighted with red dashed boxes to illustrate parallel DFS processing.
  • Figure 2: Dataflow for computing the MLE evaluation workload (input size $2^4$). Final output is marked in blue. Partitioning into two subtrees is shown with red dashed boxes to illustrate parallel DFS processing.
  • Figure 3: MTU architecture with 8 PEs for Hybrid Traversal. Each PE supports modular arithmetic and hashing. Connection enables both forward and inverted binary trees computation. All PEs can directly output results of the accelerator.
  • Figure 4: MTU runtime across workloads of size $2^{20}$, grouped by traversal types. Each group shows hardware runtime under varying hardware bandwidths (GB/s) and PE counts. Note that DFS is parallelized by partitioning the binary tree into subtrees, resulting in non-contiguous input/output indices.
  • Figure 5: CPU performance across different workloads using BFS and DFS. Each workload is evaluated at size $2^{20}$.
  • ...and 2 more figures