Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression
Farima Fatahi Bayat, Xin Liu, H. V. Jagadish, Lu Wang
TL;DR
This work tackles factual inaccuracies in large language models by introducing LITO, a Learnable Intervention method that adaptively tunes the intensity of truth-directed interventions across contexts. Building on inference-time interventions (ITI) and RepE directions, LITO collects multiple generations at increasing intensities and uses an LSTM to predict which generation is most truthful, selecting it or abstaining when uncertainty is high. The approach yields consistent improvements in the Truthfulness-Accuracy (TA) score across multiple datasets and model families, while preserving task performance, and demonstrates transferability across intervention techniques and tasks. Overall, LITO offers a practical, adaptive mechanism to reduce hallucinations in open-domain QA with manageable inference overhead and broad applicability.
Abstract
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model's internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO.
