Table of Contents
Fetching ...

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang, Tian Yu, Yang Feng

TL;DR

TruthX addresses LLM hallucinations by enabling inference-time truthfulness editing through a decoupled truthful and semantic latent space learned via an auto-encoder. A contrastive objective identifies a truthful editing direction, which is then applied to edited representations in the truthful space across top-editing layers to boost truthfulness without sacrificing generative ability. Empirical results on TruthfulQA and additional benchmarks across 13 LLMs show substantial gains in truthfulness (approximately 20% on average) and competitive informativeness, with analyses confirming the method’s robustness and layer-wise behavior. The work also demonstrates cross-model generalizability within homologous model families and highlights middle layers as most informative for truthfulness control.

Abstract

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

TL;DR

TruthX addresses LLM hallucinations by enabling inference-time truthfulness editing through a decoupled truthful and semantic latent space learned via an auto-encoder. A contrastive objective identifies a truthful editing direction, which is then applied to edited representations in the truthful space across top-editing layers to boost truthfulness without sacrificing generative ability. Empirical results on TruthfulQA and additional benchmarks across 13 LLMs show substantial gains in truthfulness (approximately 20% on average) and competitive informativeness, with analyses confirming the method’s robustness and layer-wise behavior. The work also demonstrates cross-model generalizability within homologous model families and highlights middle layers as most informative for truthfulness control.

Abstract

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.
Paper Structure (66 sections, 14 equations, 11 figures, 9 tables)

This paper contains 66 sections, 14 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: A case to show that TruthX can control LLM to generate truthful or hallucinatory coherent responses via editing one vector in LLM's internal representations.
  • Figure 2: The schematic diagram of TruthX, which maps the LLM's internal representations into truthful and semantic latent spaces, and then probes and edits the LLM in the truthful space, thereby enhancing its truthfulness.
  • Figure 3: Improvements of TruthX brought to various LLMs on TruthfulQA benchmark.
  • Figure 4: Perplexity of generating results on TruthfulQA, evaluated by Llama-2-7B-Chat.
  • Figure 5: Kernel density estimate of latent spaces.
  • ...and 6 more figures