Table of Contents
Fetching ...

Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Shuo Zhang, Fabrizio Gotti, Fengran Mo, Jian-Yun Nie

TL;DR

This paper investigates whether lexical training-data coverage, quantified via $n$-gram frequencies in a large pretraining corpus, can aid hallucination detection in LLMs. It constructs scalable suffix-array indices over RedPajama's 1.3T-token corpus to extract prompt and generation $n$-gram statistics and combines these lexical features with intrinsic log-probability signals. Across TriviaQA, CoQA, and NQ-Open using RedPajama-INCITE 3B and 7B, results show that $n$-gram occurrence features alone are only weak predictors, but provide consistent gains when fused with generation and prompt log-probabilities, especially under uncertain model conditions. The study also reveals substantial data sparsity in common $n$-grams, suggesting that LLMs frequently generalize beyond memorized sequences, which has implications for designing robust hallucination detectors and interpreting model behavior in open-domain QA.

Abstract

Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd.

Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

TL;DR

This paper investigates whether lexical training-data coverage, quantified via -gram frequencies in a large pretraining corpus, can aid hallucination detection in LLMs. It constructs scalable suffix-array indices over RedPajama's 1.3T-token corpus to extract prompt and generation -gram statistics and combines these lexical features with intrinsic log-probability signals. Across TriviaQA, CoQA, and NQ-Open using RedPajama-INCITE 3B and 7B, results show that -gram occurrence features alone are only weak predictors, but provide consistent gains when fused with generation and prompt log-probabilities, especially under uncertain model conditions. The study also reveals substantial data sparsity in common -grams, suggesting that LLMs frequently generalize beyond memorized sequences, which has implications for designing robust hallucination detectors and interpreting model behavior in open-domain QA.

Abstract

Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve -gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd.

Paper Structure

This paper contains 35 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: AUROC curves comparing log-probabilities and occurrence-based features across datasets with RedPajama-INCITE-7B model and EM evaluation.
  • Figure 2: Decision trees (depth 3) on NQ-Open. The full-feature model (left) splits early on generation 2-gram score (gen_occ_2), isolating a hallucination-prone cluster. The log-only tree (right) lacks this separation.
  • Figure 3: Distribution of prompt 3-gram scores for hallucinated and non-hallucinated answers on TriviaQA.
  • Figure 4: AUROC curves comparing log-probabilities and occurrence-based features across datasets. Model: RedPajama-INCITE-7B; metric: EM
  • Figure 5: AUROC curves comparing log-probabilities and occurrence-based features across datasets. Model: RedPajama-INCITE-7B; metric: ROUGE-L.
  • ...and 2 more figures