TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang; Tian Yu; Yang Feng

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang, Tian Yu, Yang Feng

TL;DR

TruthX addresses LLM hallucinations by enabling inference-time truthfulness editing through a decoupled truthful and semantic latent space learned via an auto-encoder. A contrastive objective identifies a truthful editing direction, which is then applied to edited representations in the truthful space across top-editing layers to boost truthfulness without sacrificing generative ability. Empirical results on TruthfulQA and additional benchmarks across 13 LLMs show substantial gains in truthfulness (approximately 20% on average) and competitive informativeness, with analyses confirming the method’s robustness and layer-wise behavior. The work also demonstrates cross-model generalizability within homologous model families and highlights middle layers as most informative for truthfulness control.

Abstract

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

TL;DR

Abstract

Paper Structure (66 sections, 14 equations, 11 figures, 9 tables)

This paper contains 66 sections, 14 equations, 11 figures, 9 tables.

Introduction
Related Work
TruthX
Extracting Internal Representations
Probing with Auto-Encoder
Editing in Truthful Space
Experiments
Datasets
Baselines
Main Results
Results on More LLMs
Analyses
Ablation Study
Superiority of Editing in Truthful Space
Effect of Editing Layers and Strength
...and 51 more sections

Figures (11)

Figure 1: A case to show that TruthX can control LLM to generate truthful or hallucinatory coherent responses via editing one vector in LLM's internal representations.
Figure 2: The schematic diagram of TruthX, which maps the LLM's internal representations into truthful and semantic latent spaces, and then probes and edits the LLM in the truthful space, thereby enhancing its truthfulness.
Figure 3: Improvements of TruthX brought to various LLMs on TruthfulQA benchmark.
Figure 4: Perplexity of generating results on TruthfulQA, evaluated by Llama-2-7B-Chat.
Figure 5: Kernel density estimate of latent spaces.
...and 6 more figures

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

TL;DR

Abstract

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Authors

TL;DR

Abstract

Table of Contents

Figures (11)