Table of Contents
Fetching ...

DiFR: Inference Verification Despite Nondeterminism

Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr

TL;DR

This work tackles the problem of verifying LLM inference amid nondeterminism by introducing Token-DiFR, which conditions verification on a shared sampling seed to reduce outputs to a near-deterministic process, and Activation-DiFR, which uses random projections to fingerprint activations for forward-pass verification. Token-DiFR delivers zero-communication, token-level evidence of correctness and detects issues such as 4-bit quantization with AUC $>$ 0.999 within a small number of tokens, while Activation-DiFR achieves high detection with minimal payload and can outperform prior fingerprinting approaches in communication efficiency. The authors provide extensive empirical validation across multiple models and configurations, including misconfigurations and sampling bugs, and release an open-source vLLM integration to enable practical deployment. They also discuss deployment considerations, suggesting a practical mix of detectors and advocating for standardized sampling implementations to facilitate verification across providers. Overall, Token-DiFR and Activation-DiFR offer robust, scalable verification for open-weight models, enabling trust and transparency in increasingly widespread inference services.

Abstract

As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.

DiFR: Inference Verification Despite Nondeterminism

TL;DR

This work tackles the problem of verifying LLM inference amid nondeterminism by introducing Token-DiFR, which conditions verification on a shared sampling seed to reduce outputs to a near-deterministic process, and Activation-DiFR, which uses random projections to fingerprint activations for forward-pass verification. Token-DiFR delivers zero-communication, token-level evidence of correctness and detects issues such as 4-bit quantization with AUC 0.999 within a small number of tokens, while Activation-DiFR achieves high detection with minimal payload and can outperform prior fingerprinting approaches in communication efficiency. The authors provide extensive empirical validation across multiple models and configurations, including misconfigurations and sampling bugs, and release an open-source vLLM integration to enable practical deployment. They also discuss deployment considerations, suggesting a practical mix of detectors and advocating for standardized sampling implementations to facilitate verification across providers. Overall, Token-DiFR and Activation-DiFR offer robust, scalable verification for open-weight models, enabling trust and transparency in increasingly widespread inference services.

Abstract

As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.

Paper Structure

This paper contains 55 sections, 11 equations, 15 figures, 8 tables, 4 algorithms.

Figures (15)

  • Figure 1: Token-DiFR accurately detects major misconfigurations even under GPU-mismatched deployments. We plot batch-level detection performance (AUC at 1% FPR) as a function of the number of tokens when distinguishing the reference configuration from four misconfigurations (4-bit model quantization, FP8 KV-cache quantization, incorrect sampling seed, and temperature 1.1 vs. 1.0). For Llama 3.1 8B (left), Token-DiFR reliably detects all misconfigurations within a few thousand tokens. For Qwen3-30B-A3B (right), it still cleanly detects large deviations, but the smallest change (temperature +0.1) is harder to separate from the pooled honest baseline because benign differences between A100 and H200 deployments broaden the null distribution. In (matched) settings where the verifier and provider share the same H200 setup Qwen3-30B-A3B shows similarly strong performance to Llama 3.1 8B (see Appendix \ref{['app:qwen3-30b-a3b-results']}).
  • Figure 2: Cross-entropy is vulnerable to simple adversarial manipulation. Using Llama 3.1 8B, we consider 4-bit model quantization and FP8 KV-cache quantization and tune the sampling temperature of the misconfigured model until its mean cross-entropy matches the reference configuration. Under this attack, cross-entropy-based detectors fall to near-chance performance, while Token-DiFR maintains high detection accuracy across batch sizes.
  • Figure 3: Detection of simulated sampling bugs for Llama 3.1 8B and Qwen3-30B-A3B. We introduce a bug that, with probability 1% per token, ignores the model logits and instead samples uniformly from the top-$k$ tokens ($k \in \{2, 32\}$), and otherwise samples correctly. The curves show AUC at 1% FPR as a function of batch size for cross-entropy and Token-DiFR variants. For Llama 3.1 8B, Token-DiFR detects both bug settings with modest batch sizes. For Qwen3-30B-A3B, the simple mean-pooled margin score underperforms cross-entropy in the $k=32$ case and fails to separate $k=2$ bugs from the pooled honest baseline. We pool per-token scores by taking their mean for consistency with our other figures, but in Appendix \ref{['app:rare-bug-pooling']} we show that simple alternative aggregation schemes which emphasize rare large deviations restore strong Token-DiFR performance in this regime, and we recommend that practitioners consider monitoring multiple aggregation strategies in parallel.
  • Figure 4: Activation-DiFR detects quantization with small batches and outperforms TOPLOC at fixed communication cost. We plot AUC at $<$ 1% FPR as a function of batch size when distinguishing the reference configuration from 4-bit model quantization and FP8 KV-cache quantization, using activation fingerprints with projection dimension $k \in \{2, 8, 32\}$. Across both Llama 3.1 8B and Qwen3-30B-A3B, Activation-DiFR reaches near-saturated detection with small batches (for example, FP8 KV-cache quantization on Llama 3.1 8B is detected with roughly 4 tokens at $k=8$) and, at any fixed $k$, matches or exceeds TOPLOC’s detection accuracy. In the mismatched Qwen3-30B-A3B setting, classification still eventually saturates but requires larger batch sizes, similar to the increased difficulty observed for Token-DiFR under pooled honest environments.
  • Figure 5: Token-DiFR and cross-entropy can audit in-the-wild Llama 3.1 8B deployments at temperature zero. Low scores indicate tight adherence to the chosen reference specification, while higher scores indicate divergence, which can stem from either genuinely degraded inference (for example, heavy quantization) or benign specification differences such as alternative tokenizers or chat-template formats. In this non-adversarial, temperature-zero setting, Token-DiFR and cross-entropy behave similarly and are equally simple to compute without seed synchronization.
  • ...and 10 more figures