Table of Contents
Fetching ...

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang, Judith L. Mwakalonge

TL;DR

Adaptive Activation Cancellation is proposed, a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing.

Abstract

Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

TL;DR

Adaptive Activation Cancellation is proposed, a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing.

Abstract

Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.
Paper Structure (38 sections, 9 equations, 8 figures, 23 tables, 2 algorithms)

This paper contains 38 sections, 9 equations, 8 figures, 23 tables, 2 algorithms.

Figures (8)

  • Figure 1: Cancellation selectivity by method and model. Values above the dashed line (Sel $=$ 1) indicate net benefit. Post-hoc selectivity is non-monotonic across scale: Phi-3-mini is lowest ($1.72\times$) while LLaMA 3-8B recovers to $5.58\times$.
  • Figure 2: Selectivity, reduction, and drift vs. percentile threshold for OPT-125M. Selectivity rises super-linearly above the 90th percentile as drift approaches zero.
  • Figure 3: Static vs. adaptive ANC: hallucination confidence, grounded confidence, and selectivity across all three model scales. Adaptive confidence weighting reduces grounded drift by 25.9--40.1%.
  • Figure 4: Probe selectivity (left) and MC1 generation delta (right) for ITI, DoLA, and H-Node ANC across all three model scales. H-Node ANC leads in selectivity at OPT and LLaMA scale; DoLA leads in MC1 at LLaMA scale.
  • Figure 5: Top-5 H-Node activation gaps per model at the best probe layer. Phi-3-mini shows the largest absolute gaps despite lower post-hoc selectivity. The cross-model attractor ($\dagger$) is the Angelina Jolie celebrity-fact prompt.
  • ...and 3 more figures