ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models
Mina Namazi, Alexander Nemecek, Erman Ayday
TL;DR
<3-5 sentence high-level summary>ZKPROV tackles the challenge of proving that an LLM’s outputs are derived from authenticated, relevant datasets without exposing sensitive data or model parameters. It introduces a privacy-preserving framework that cryptographically binds responses to dataset provenance using Reckle Trees, KZG commitments, and a recursive zero-knowledge proof system (HyperNova), enabling end-to-end verification with sublinear proof growth. The approach provides formal guarantees of dataset privacy, soundness, and transcript binding while achieving practical latency (under 3.3 seconds per query for models up to 8B parameters) in experiments on biomedical domain data. This work lays the groundwork for trustworthy AI in regulated sectors by coupling provenance attestations with zero-knowledge proofs and efficient cryptographic primitives.
Abstract
As large language models (LLMs) are used in sensitive fields, accurately verifying their computational provenance without disclosing their training datasets poses a significant challenge, particularly in regulated sectors such as healthcare, which have strict requirements for dataset use. Traditional approaches either incur substantial computational cost to fully verify the entire training process or leak unauthorized information to the verifier. Therefore, we introduce ZKPROV, a novel cryptographic framework allowing users to verify that the LLM's responses to their prompts are trained on datasets certified by the authorities that own them. Additionally, it ensures that the dataset's content is relevant to the users' queries without revealing sensitive information about the datasets or the model parameters. ZKPROV offers a unique balance between privacy and efficiency by binding training datasets, model parameters, and responses, while also attaching zero-knowledge proofs to the responses generated by the LLM to validate these claims. Our experimental results demonstrate sublinear scaling for generating and verifying these proofs, with end-to-end overhead under 3.3 seconds for models up to 8B parameters, presenting a practical solution for real-world applications. We also provide formal security guarantees, proving that our approach preserves dataset confidentiality while ensuring trustworthy dataset provenance.
