Table of Contents
Fetching ...

From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation

Zhantong Xue, Pingchuan Ma, Zhaoyu Wang, Yuguang Zhou, Xiaoqin Zhang, Shuai Wang, Juergen Rahmel

TL;DR

The paper addresses the barrier of authoring zero-knowledge proof (ZKP) programs by systematically evaluating large language models (LLMs) on ZK code generation and introducing an augmentation framework. It presents ZK-Eval, a three-stage benchmark that probes language knowledge, algebraic primitive competence, and end-to-end generation across Circom and Noir, revealing strong surface-level language abilities but weak constraint-level reasoning. To bridge this gap, it introduces ZK-Coder, which uses a constraint sketch language (ZKSL), retrieval-grounded guidance, and an interactive refinement loop to produce correct, verifiable ZK programs; across Circom and Noir, ZK-Coder yields substantial gains in end-to-end performance (e.g., up to 87.85% vs. 20.29% baseline on Circom and up to 97.79% vs. 28.38% on Noir with GPT-o3). The results establish a principled basis for measuring and augmenting LLMs in ZK code generation, with strong generalization to harder benchmarks and production patterns, offering practical pathways to lower barriers for privacy-preserving computing. The work highlights both the potential of LLMs in constrained, domain-specific programming and the need for structured grounding and iterative repair to ensure soundness and completeness in ZK implementations.

Abstract

Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, verifiable computation, and secure finance. However, authoring ZK programs remains challenging: unlike conventional software development, ZK programming manifests a fundamental paradigm shift from \textit{imperative computation} to \textit{declarative verification}. This process requires rigorous reasoning about finite field arithmetic and complex constraint systems (which is rare in common imperative languages), making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and constraint-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities on ZK programming at three levels: language knowledge, algebraic primitive competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that while models demonstrate strong proficiency in language syntax, they struggle when implementing and composing algebraic primitives to specify correct constraint systems, frequently producing incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments with GPT-o3 on Circom and Noir show substantial gains, with success rates improving from 20.29\% to 87.85\% and from 28.38\% to 97.79\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a new basis for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance privacy computing.

From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation

TL;DR

The paper addresses the barrier of authoring zero-knowledge proof (ZKP) programs by systematically evaluating large language models (LLMs) on ZK code generation and introducing an augmentation framework. It presents ZK-Eval, a three-stage benchmark that probes language knowledge, algebraic primitive competence, and end-to-end generation across Circom and Noir, revealing strong surface-level language abilities but weak constraint-level reasoning. To bridge this gap, it introduces ZK-Coder, which uses a constraint sketch language (ZKSL), retrieval-grounded guidance, and an interactive refinement loop to produce correct, verifiable ZK programs; across Circom and Noir, ZK-Coder yields substantial gains in end-to-end performance (e.g., up to 87.85% vs. 20.29% baseline on Circom and up to 97.79% vs. 28.38% on Noir with GPT-o3). The results establish a principled basis for measuring and augmenting LLMs in ZK code generation, with strong generalization to harder benchmarks and production patterns, offering practical pathways to lower barriers for privacy-preserving computing. The work highlights both the potential of LLMs in constrained, domain-specific programming and the need for structured grounding and iterative repair to ensure soundness and completeness in ZK implementations.

Abstract

Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, verifiable computation, and secure finance. However, authoring ZK programs remains challenging: unlike conventional software development, ZK programming manifests a fundamental paradigm shift from \textit{imperative computation} to \textit{declarative verification}. This process requires rigorous reasoning about finite field arithmetic and complex constraint systems (which is rare in common imperative languages), making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and constraint-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities on ZK programming at three levels: language knowledge, algebraic primitive competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that while models demonstrate strong proficiency in language syntax, they struggle when implementing and composing algebraic primitives to specify correct constraint systems, frequently producing incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments with GPT-o3 on Circom and Noir show substantial gains, with success rates improving from 20.29\% to 87.85\% and from 28.38\% to 97.79\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a new basis for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance privacy computing.

Paper Structure

This paper contains 55 sections, 1 equation, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Illustrative Comparison Between the Process of Mainstream and ZK Programming.
  • Figure 1: Mainstream & ZK Program Comparison.
  • Figure 2: Overview of ZK-Eval and ZK-Coder.
  • Figure 3: The Three-Stage Evaluation Design of the ZK-Eval Benchmark.
  • Figure 4: Accuracy of LLMs and Human Experts on the MCQ Benchmark for ZK Language Knowledge.
  • ...and 7 more figures