MTU: The Multifunction Tree Unit for Accelerating Zero-Knowledge Proofs
Jianqiao Mo, Alhad Daftardar, Joey Ah-Kiow, Kaiyue Guo, Benedikt Bünz, Siddharth Garg, Brandon Reagen
TL;DR
This paper analyzes binary-tree workloads central to zero-knowledge proofs and introduces MTU, a hardware accelerator designed to speed up SumCheck- and Merkle-tree kernels. It compares BFS, DFS, and a hardware-friendly Hybrid Traversal, demonstrating that MTU can deliver substantial speedups (up to ~1478× against CPU at DDR-level bandwidth and up to ~9440× at high bandwidth) while reducing memory traffic. The authors provide a detailed architectural design of MTU, including a DFS Accumulator and scalable PE-based fabric, and perform a thorough design-space study across bandwidth, area, and power. The results offer practical guidance for building modular, SOC-friendly ZKP accelerators capable of handling large binary-tree workloads with shared polynomial commitment foundations. This work advances hardware-software co-design for ZKPs by highlighting traversal-aware optimizations and compact, reusable accelerator building blocks.
Abstract
Zero-Knowledge Proofs (ZKPs) are critical for privacy-preserving techniques and verifiable computation. Many ZKP protocols rely on key kernels such as the SumCheck protocol and Merkle Tree commitments to enable their key security properties. These kernels exhibit balanced binary tree computational patterns, which enable efficient hardware acceleration. Although prior work has investigated accelerating these kernels as part of an overarching ZKP protocol, exploiting this common tree pattern remains relatively underexplored. We conduct a systematic evaluation of these tree-based workloads under different traversal strategies, analyzing performance on multi-threaded CPUs and the Multifunction Tree Unit (MTU) hardware accelerator. We introduce a hardware-friendly Hybrid Traversal for binary tree that improves parallelism and scalability while significantly reducing memory traffic on hardware. Our results show that MTU achieves up to $1478\times$ speedup over CPU at DDR-level bandwidth and that our hybrid traversal outperforms breadth-first search by up to $3\times$. These findings offer practical guidance for designing efficient hardware accelerators for ZKP workloads with binary tree structures.
