Table of Contents
Fetching ...

Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions

Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron

TL;DR

The paper tackles detecting hallucinations and training data contamination in LLMs under gray-box access by introducing the LLM Output Signature (LOS), which combines Token Distribution Sequences (TDS) and Actual Token Probabilities (ATP). It proposes LOS-Net, a lightweight attention-based model that encodes LOS via top-$K$ row-sorting and a rank-encoded ATP, then processes the sequence with a Transformer to produce fast, accurate detections. The authors prove LOS-Net can approximate broad classes of gated scoring functions, unifying prior gray-box methods, and demonstrate superior performance and very low latency across multiple datasets and models with strong cross-model and cross-dataset transfer. They also show robust runtime efficiency and meaningful transfer even with restricted API access. The work provides a practical, scalable framework for auditing LLM outputs and promoting safer deployment, with code available publicly.

Abstract

The automated detection of hallucinations and training data contamination is pivotal to the safe deployment of Large Language Models (LLMs). These tasks are particularly challenging in settings where no access to model internals is available. Current approaches in this setup typically leverage only the probabilities of actual tokens in the text, relying on simple task-specific heuristics. Crucially, they overlook the information contained in the full sequence of next-token probability distributions. We propose to go beyond hand-crafted decision rules by learning directly from the complete observable output of LLMs -- consisting not only of next-token probabilities, but also the full sequence of next-token distributions. We refer to this as the LLM Output Signature (LOS), and treat it as a reference data type for detecting hallucinations and data contamination. To that end, we introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LOS, which can provably approximate a broad class of existing techniques for both tasks. Empirically, LOS-Net achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency. Furthermore, it demonstrates promising transfer capabilities across datasets and LLMs. Full code is available at https://github.com/BarSGuy/Beyond-next-token-probabilities.

Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions

TL;DR

The paper tackles detecting hallucinations and training data contamination in LLMs under gray-box access by introducing the LLM Output Signature (LOS), which combines Token Distribution Sequences (TDS) and Actual Token Probabilities (ATP). It proposes LOS-Net, a lightweight attention-based model that encodes LOS via top- row-sorting and a rank-encoded ATP, then processes the sequence with a Transformer to produce fast, accurate detections. The authors prove LOS-Net can approximate broad classes of gated scoring functions, unifying prior gray-box methods, and demonstrate superior performance and very low latency across multiple datasets and models with strong cross-model and cross-dataset transfer. They also show robust runtime efficiency and meaningful transfer even with restricted API access. The work provides a practical, scalable framework for auditing LLM outputs and promoting safer deployment, with code available publicly.

Abstract

The automated detection of hallucinations and training data contamination is pivotal to the safe deployment of Large Language Models (LLMs). These tasks are particularly challenging in settings where no access to model internals is available. Current approaches in this setup typically leverage only the probabilities of actual tokens in the text, relying on simple task-specific heuristics. Crucially, they overlook the information contained in the full sequence of next-token probability distributions. We propose to go beyond hand-crafted decision rules by learning directly from the complete observable output of LLMs -- consisting not only of next-token probabilities, but also the full sequence of next-token distributions. We refer to this as the LLM Output Signature (LOS), and treat it as a reference data type for detecting hallucinations and data contamination. To that end, we introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LOS, which can provably approximate a broad class of existing techniques for both tasks. Empirically, LOS-Net achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency. Furthermore, it demonstrates promising transfer capabilities across datasets and LLMs. Full code is available at https://github.com/BarSGuy/Beyond-next-token-probabilities.

Paper Structure

This paper contains 31 sections, 8 theorems, 40 equations, 14 figures, 7 tables.

Key Result

Proposition 1

Let $\mathcal{B}$ be the set of scoring functions implemented by the Min/Max/Mean aggregated probability methods guerreiro2022lookingkadavath2022languagevarshney2023stitchhuang2023look for HD, as well as Loss yeom2018privacy, the MinK% shi2023detecting and MinK%++ zhang2024min methods for DCD. For a

Figures (14)

  • Figure 1: Left: The LLM processes the input "What does the cat chase?" and generates the output "A big mouse". Right: The corresponding query/response Token Distribution Sequences (TDS) and Actual Token Probabilities (ATP), together constituting the LLM Output Signature (LOS). We propose to detect instances of hallucinations and data contamination by learning directly over this unified data representation, beyond task specific heuristics operating on partial information thereof.
  • Figure 2: Transfer Test AUC to varying datasets (top, Mis-7b) and LLMs (bottom, IMDB fixed).
  • Figure 3: BookMIA zero-shot AUC -- bold, $^*$: outperforms, resp., ref-free and ref-based.
  • Figure 4: Cross-LLM transfer Test AUCs (cols: source LLMs, rows: target LLMs). Bold: finetuning LOS-Net outperforms baselines, $^*$: it outperforms the same LOS-Net trained from scratch.
  • Figure 5: Cross-dataset transfer Test AUCs (cols: source data, rows: target data). Bold: finetuning LOS-Net outperforms baselines, $^*$: it outperforms the same LOS-Net trained from scratch.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Proposition 1: GSFs capture known baselines
  • Proposition 2: can approximate \ref{['eq:GSF']}
  • corollary 1: Approximation of Baselines by
  • Proposition 2: can approximate \ref{['eq:GSF']}
  • proof
  • Lemma 1
  • proof
  • Theorem 1
  • Proposition 2: GSFs capture known baselines
  • proof
  • ...and 3 more