Table of Contents
Fetching ...

Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression

Farima Fatahi Bayat, Xin Liu, H. V. Jagadish, Lu Wang

TL;DR

This work tackles factual inaccuracies in large language models by introducing LITO, a Learnable Intervention method that adaptively tunes the intensity of truth-directed interventions across contexts. Building on inference-time interventions (ITI) and RepE directions, LITO collects multiple generations at increasing intensities and uses an LSTM to predict which generation is most truthful, selecting it or abstaining when uncertainty is high. The approach yields consistent improvements in the Truthfulness-Accuracy (TA) score across multiple datasets and model families, while preserving task performance, and demonstrates transferability across intervention techniques and tasks. Overall, LITO offers a practical, adaptive mechanism to reduce hallucinations in open-domain QA with manageable inference overhead and broad applicability.

Abstract

Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model's internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO.

Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression

TL;DR

This work tackles factual inaccuracies in large language models by introducing LITO, a Learnable Intervention method that adaptively tunes the intensity of truth-directed interventions across contexts. Building on inference-time interventions (ITI) and RepE directions, LITO collects multiple generations at increasing intensities and uses an LSTM to predict which generation is most truthful, selecting it or abstaining when uncertainty is high. The approach yields consistent improvements in the Truthfulness-Accuracy (TA) score across multiple datasets and model families, while preserving task performance, and demonstrates transferability across intervention techniques and tasks. Overall, LITO offers a practical, adaptive mechanism to reduce hallucinations in open-domain QA with manageable inference overhead and broad applicability.

Abstract

Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model's internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO.
Paper Structure (33 sections, 6 equations, 5 figures, 3 tables)

This paper contains 33 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Model responses using the inference-time intervention method with intensities increasing from 5 to 25. For different queries, the model achieves correct responses at varying intensity levels, indicated by green (correct) and red (incorrect) colors. Darkness of color represents the model's confidence in its response.
  • Figure 2: Overview of LITO method. Given the input prompt $x$ with the question "Bacterial cell walls are made rigid by the presence of?", our method first collects model-generated responses after applying ITI-identified directions at 5 intensities $LLM_{\alpha=5k}(x)$ (Section \ref{['sec:app1']}). Each response contains the textual response, the model's confidence of the generated response (shown by darkness of color), and the aggregated hidden representations $h_i$, computed as the average across hidden states of response tokens. LITO predicts the accuracy of each response given its hidden representations and selects the accurate response (labeled as $1$) with the highest confidence or indicates uncertainty.
  • Figure 3: Truthfulness and accuracy scores per dataset on five LMs. ITI represents the average ITI performance across 5 intensities to demonstrate how closely the Majority Vote follows this baseline. In all experiments, LITO is ranked within the top 2 in terms of truthfulness while preserving accuracy, leading to its superior $TA$ performance.
  • Figure 4: Transfer results of ITI-based LITO, measured by $TA$ score on 5 LMs. The y-axis corresponds to the training dataset, and the x-axis corresponds to the test dataset. Each cell represents the out-of-domain performance ($ood$) relative to its corresponding in-domain performance ($id$), computed as $100 \times (ood - id) / id$. Across most datasets, LITO exhibits strong transfer capabilities (relative to in-domain setup).
  • Figure 5: Performance of LITO on validation set of 4 datasets using different $k$ values. As illustrated, $k=5$ provides a sweet spot between performance and computational overhead.