Table of Contents
Fetching ...

One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs

Pengbo Wang, Chaozhuo Li, Chenxu Wang, Liwen Zheng, Litian Zhang, Xi Zhang

TL;DR

SPACE identifies and targets a shared activation subspace in LLMs that jointly governs factuality and faithfulness, addressing a key limitation of prior, single-task approaches. It combines empirical activation analysis, a convex-theoretic guarantee of overlapped subspaces, and a four-stage workflow—activation profiling, contrastive probing, semantic cluster fusion, and dynamic space editing—to edit this common space via targeted head interventions. The approach uses a suite of losses and clustering techniques, including $\mathcal{L}_{ctr}$, $\mathcal{L}_{orth}$, and hard-ctr with a learned direction $\mathbf{d}$, culminating in a decoding-time update rule that injects $\sum_{h} s_l^h \theta_l^h$ into layer outputs. Across TruthfulQA and PDTB benchmarks, SPACE yields consistent improvements in both factuality and faithfulness and demonstrates architecture-agnostic generalization, suggesting practical impact for more reliable real-world deployments of LLMs.

Abstract

LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.

One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs

TL;DR

SPACE identifies and targets a shared activation subspace in LLMs that jointly governs factuality and faithfulness, addressing a key limitation of prior, single-task approaches. It combines empirical activation analysis, a convex-theoretic guarantee of overlapped subspaces, and a four-stage workflow—activation profiling, contrastive probing, semantic cluster fusion, and dynamic space editing—to edit this common space via targeted head interventions. The approach uses a suite of losses and clustering techniques, including , , and hard-ctr with a learned direction , culminating in a decoding-time update rule that injects into layer outputs. Across TruthfulQA and PDTB benchmarks, SPACE yields consistent improvements in both factuality and faithfulness and demonstrates architecture-agnostic generalization, suggesting practical impact for more reliable real-world deployments of LLMs.

Abstract

LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.

Paper Structure

This paper contains 20 sections, 23 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The illustrations of two hallucination types and the performance trade-off.
  • Figure 2: Activation Distributions of Factual and Faithful Tasks. Subfigure (a) shows the proportion of disjoint and interwoven activation pattern after training. Subfigure (b) and (c) show two cases of disjoint activation pattern and interwoven activation pattern respectfully.
  • Figure 3: Framework of the proposed SPACE model.
  • Figure 4: Ablation study of SPACE on LLaMA-2-7B-Chat: dark green bars indicate original performance, light green bars indicate performance after removing key components; "Fact." refers to True*Info % (TruthfulQA), "Faith." refers to DISQ Overall % (PDTB).
  • Figure 5: Hyperparameter sensitivity analysis for $\alpha$ and k
  • ...and 1 more figures