DiFR: Inference Verification Despite Nondeterminism

Adam Karvonen; Daniel Reuter; Roy Rinberg; Luke Marks; Adrià Garriga-Alonso; Keri Warr

DiFR: Inference Verification Despite Nondeterminism

Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr

TL;DR

This work tackles the problem of verifying LLM inference amid nondeterminism by introducing Token-DiFR, which conditions verification on a shared sampling seed to reduce outputs to a near-deterministic process, and Activation-DiFR, which uses random projections to fingerprint activations for forward-pass verification. Token-DiFR delivers zero-communication, token-level evidence of correctness and detects issues such as 4-bit quantization with AUC $>$ 0.999 within a small number of tokens, while Activation-DiFR achieves high detection with minimal payload and can outperform prior fingerprinting approaches in communication efficiency. The authors provide extensive empirical validation across multiple models and configurations, including misconfigurations and sampling bugs, and release an open-source vLLM integration to enable practical deployment. They also discuss deployment considerations, suggesting a practical mix of detectors and advocating for standardized sampling implementations to facilitate verification across providers. Overall, Token-DiFR and Activation-DiFR offer robust, scalable verification for open-weight models, enabling trust and transparency in increasingly widespread inference services.

Abstract

As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.

DiFR: Inference Verification Despite Nondeterminism

TL;DR

0.999 within a small number of tokens, while Activation-DiFR achieves high detection with minimal payload and can outperform prior fingerprinting approaches in communication efficiency. The authors provide extensive empirical validation across multiple models and configurations, including misconfigurations and sampling bugs, and release an open-source vLLM integration to enable practical deployment. They also discuss deployment considerations, suggesting a practical mix of detectors and advocating for standardized sampling implementations to facilitate verification across providers. Overall, Token-DiFR and Activation-DiFR offer robust, scalable verification for open-weight models, enabling trust and transparency in increasingly widespread inference services.

Abstract

0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC

0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.

DiFR: Inference Verification Despite Nondeterminism

TL;DR

Abstract

DiFR: Inference Verification Despite Nondeterminism

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)