Table of Contents
Fetching ...

Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training

Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi

TL;DR

This work addresses hallucination risk during LLM training by identifying persistent oscillations in hallucination metrics across checkpoints and model scales. It introduces Sensitivity Dropout (SenD), which deterministically drops Sensitive Embedding Indices (SEIs) identified from penultimate-layer variability, and Efficient EigenScore (EES), a scalable surrogate for EigenScore based on Density of States and Chebyshev polynomials that enables fast, unsupervised hallucination detection. Empirically, SenD reduces training-time hallucination variance and achieves up to 17% improvements in test-time reliability and better factual accuracy across Wikipedia, Medical, Legal, and Coding domains without degrading downstream performance; EES closely tracks EigenScore while cutting computation time roughly in half at large scales. The method demonstrates that training-dynamics-aware mitigation can complement post-hoc approaches like RAG, with practical potential for integration into large-model pipelines, though validation is still limited to continual training due to compute constraints. The work provides a scalable framework for stabilizing LLM training and offers new insights into how internal dynamics relate to hallucination risk, paving the way for broader adoption and scaling to larger pretraining regimes.

Abstract

As large language models (LLMs) become increasingly prevalent, concerns about their reliability, particularly due to hallucinations - factually inaccurate or irrelevant outputs - have grown. Our research investigates the relationship between the uncertainty in training dynamics and the emergence of hallucinations. Using models from the Pythia suite and several hallucination detection metrics, we analyze hallucination trends and identify significant variance during training. To address this, we propose Sensitivity Dropout (SenD), a novel training protocol designed to reduce hallucination variance during training by deterministically dropping embedding indices with significant variability. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This metric is integrated into our training protocol, allowing SenD to be both computationally scalable and effective at reducing hallucination variance. SenD improves test-time reliability of Pythia and Meta's Llama models by up to 17% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains without affecting downstream task performance.

Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training

TL;DR

This work addresses hallucination risk during LLM training by identifying persistent oscillations in hallucination metrics across checkpoints and model scales. It introduces Sensitivity Dropout (SenD), which deterministically drops Sensitive Embedding Indices (SEIs) identified from penultimate-layer variability, and Efficient EigenScore (EES), a scalable surrogate for EigenScore based on Density of States and Chebyshev polynomials that enables fast, unsupervised hallucination detection. Empirically, SenD reduces training-time hallucination variance and achieves up to 17% improvements in test-time reliability and better factual accuracy across Wikipedia, Medical, Legal, and Coding domains without degrading downstream performance; EES closely tracks EigenScore while cutting computation time roughly in half at large scales. The method demonstrates that training-dynamics-aware mitigation can complement post-hoc approaches like RAG, with practical potential for integration into large-model pipelines, though validation is still limited to continual training due to compute constraints. The work provides a scalable framework for stabilizing LLM training and offers new insights into how internal dynamics relate to hallucination risk, paving the way for broader adoption and scaling to larger pretraining regimes.

Abstract

As large language models (LLMs) become increasingly prevalent, concerns about their reliability, particularly due to hallucinations - factually inaccurate or irrelevant outputs - have grown. Our research investigates the relationship between the uncertainty in training dynamics and the emergence of hallucinations. Using models from the Pythia suite and several hallucination detection metrics, we analyze hallucination trends and identify significant variance during training. To address this, we propose Sensitivity Dropout (SenD), a novel training protocol designed to reduce hallucination variance during training by deterministically dropping embedding indices with significant variability. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This metric is integrated into our training protocol, allowing SenD to be both computationally scalable and effective at reducing hallucination variance. SenD improves test-time reliability of Pythia and Meta's Llama models by up to 17% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains without affecting downstream task performance.

Paper Structure

This paper contains 32 sections, 2 theorems, 24 equations, 13 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

Let $f = \log$. Then, for a covariance matrix $H$ with eigenvalues $\lambda_i$, we have where $\lambda_i$ are the eigenvalues of $H$.

Figures (13)

  • Figure 1: Visualization of Oscillatory Behavior. (a) SelfCheckGPT and (b) HaluEval EM metrics across model sizes from 70M-12B. Solid lines show average performance, shaded regions indicate standard deviation. High variance in hallucination metrics highlights the need for stabilization. For Perplexity (PPL), Rouge1 and other HaluEval metrics refer to Appendix \ref{['appendix:oscillations']}.
  • Figure 2: Comparison of sensitive embedding index dropout on inference of Eleuther AI's Pythia modeld with random embedding index dropout. The Y axis, Drop Value, denotes the decrease in EigenScore, ie less confabulations, following the dropout method used. (a) SEI dropout results are consistent across model sizes. (b) Hallucinatory outputs show a larger EigenScore drop than correct ones. SEI dropout significantly reduces EigenScore compared to random dropout in both (a) and (b)
  • Figure 3: Efficient EigenScore approximation scaling. Computation time comparison between EigenScore and EES (moments = 20). The x-axis represents matrix size (rows × columns), and the y-axis shows computation time. As matrix size increases, EES consistently reduces computation time, making it a practical choice.
  • Figure 4: Regular Training vs. SenD on HELM and LegalBench datasets. The first row represents Llama 3.1 8B while the second row shows Pythia 1B models. Column one (a) and (e) is trained on the HELM dataset. Column two (b) and (f) is trained on LegalBench. Column three (c) and (g) use the MedHalt dataset. Column four (d) and (h) are trained on CodeSearchNet. In all cases training with SenD demonstrates a more controlled reduction in EES, optimizing for hallucination mitigation and loss stability. For results on Llama 3.2 1B training, refer to Appendix \ref{['appendix:send_more_experiments']}.
  • Figure 5: Net change of sentence embeddings between checkpoints 125,000 and 143,000. Each different colour is a different input text. As depicted, there are specific embedding indices that go through drastic changes between the two checkpoints of the training regardless of the input.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 3.1: Sentence Embedding Vector
  • Definition 3.2: Net Change Formula
  • Definition 3.3: Sensitive Embedding Indices - SEIs
  • Definition 3.4: EigenScore
  • Lemma 1
  • Proposition 1