Table of Contents
Fetching ...

Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration

Aayush Gupta

TL;DR

This work tackles the pervasive hallucination problem in large language models by introducing Fact Grounded Attention (FGA), an architectural modification that injects verifiable external facts directly into the transformer’s attention mechanism. FGA combines an external knowledge base with an attention-grounding matrix, a learnable gate to route factual grounding, and hard vocabulary constraints to guarantee deterministic correctness when KB coverage is complete. Empirically, FGA dramatically improves factual accuracy on 1,107 technical queries (from 6.3% baseline to 99.7% with fine-tuning) and enables instant knowledge updates in under a second, with only a small computational overhead (~3%). The approach yields strong gains on public benchmarks, offers traceability and domain adaptability, and lays a foundation for deterministic, knowledge-driven neural generation in knowledge-intensive domains. ${S}_{FGA} = S + \alpha \odot G$ and $B_{qf} = \frac{QK_{fact}^T}{\sqrt{d_k}}$ are central formulas governing grounding in the attention scores.

Abstract

"The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." Large Language Models have conquered natural language but remain prisoners of their own probabilistic nature--confidently hallucinating facts they never truly knew. We present Fact Grounded Attention (FGA), a novel architectural modification that transforms unreliable language models into deterministic truth tellers by injecting verifiable knowledge directly into the attention mechanism. Unlike existing approaches that patch hallucinations after generation or prepend retrieved text, FGA intervenes at the mathematical heart of the transformer--the pre-softmax attention scores--creating a model that cannot hallucinate when facts exist in its knowledge base. Our experiments across 1,107 technical queries spanning smartphones, laptops, and electric vehicles demonstrate a transformation from 6.3% accuracy in vanilla Llama 3.2 to 99.7% accuracy with FGA. More critically, knowledge updates occur in under one second without retraining, compared to hours for parameter editing approaches. FGA doesn't just reduce hallucination--it eliminates it entirely for verifiable facts, marking a fundamental shift from probabilistic approximation to deterministic precision in neural language generation.

Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration

TL;DR

This work tackles the pervasive hallucination problem in large language models by introducing Fact Grounded Attention (FGA), an architectural modification that injects verifiable external facts directly into the transformer’s attention mechanism. FGA combines an external knowledge base with an attention-grounding matrix, a learnable gate to route factual grounding, and hard vocabulary constraints to guarantee deterministic correctness when KB coverage is complete. Empirically, FGA dramatically improves factual accuracy on 1,107 technical queries (from 6.3% baseline to 99.7% with fine-tuning) and enables instant knowledge updates in under a second, with only a small computational overhead (~3%). The approach yields strong gains on public benchmarks, offers traceability and domain adaptability, and lays a foundation for deterministic, knowledge-driven neural generation in knowledge-intensive domains. and are central formulas governing grounding in the attention scores.

Abstract

"The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." Large Language Models have conquered natural language but remain prisoners of their own probabilistic nature--confidently hallucinating facts they never truly knew. We present Fact Grounded Attention (FGA), a novel architectural modification that transforms unreliable language models into deterministic truth tellers by injecting verifiable knowledge directly into the attention mechanism. Unlike existing approaches that patch hallucinations after generation or prepend retrieved text, FGA intervenes at the mathematical heart of the transformer--the pre-softmax attention scores--creating a model that cannot hallucinate when facts exist in its knowledge base. Our experiments across 1,107 technical queries spanning smartphones, laptops, and electric vehicles demonstrate a transformation from 6.3% accuracy in vanilla Llama 3.2 to 99.7% accuracy with FGA. More critically, knowledge updates occur in under one second without retraining, compared to hours for parameter editing approaches. FGA doesn't just reduce hallucination--it eliminates it entirely for verifiable facts, marking a fundamental shift from probabilistic approximation to deterministic precision in neural language generation.

Paper Structure

This paper contains 62 sections, 4 theorems, 11 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $s_i$ be the original attention score for token $i$ and $g_i$ be its grounding score. The probability ratio between grounded and ungrounded tokens is:

Figures (4)

  • Figure 1: FGA Architecture: The standard attention scores $S$ are augmented with fact grounded scores $G$ from the knowledge base, modulated by a learnable gate $\alpha$ that determines when factual grounding is needed.
  • Figure 2: Dimensional analysis of FGA grounding computation. The query fact bias $B_{qf} \in \mathbb{R}^{L \times M}$ is multiplied with entity assignment matrix $A \in \mathbb{R}^{M \times L}$ to produce grounding scores $G \in \mathbb{R}^{L \times L}$, matching the dimensions of attention scores $S$ for direct addition.
  • Figure 3: Exponential amplification of token probabilities through FGA grounding. The probability ratio $e^{\alpha g}$ shows how grounded tokens become exponentially more likely as the grounding score $g$ and gate value $\alpha$ increase. At typical operating point ($\alpha=0.8, g=5$), tokens receive a 55× probability boost.
  • Figure 4: Gate activation distribution for factual vs. non-factual contexts

Theorems & Definitions (6)

  • Theorem 1: Grounding Amplification
  • proof
  • Theorem 2: Grounding Convergence
  • proof : Proof Sketch
  • Theorem 3: Knowledge Capacity
  • Theorem 4: Update Efficiency