Table of Contents
Fetching ...

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

TL;DR

TruthRL introduces a truthfulness-focused RL framework for LLMs, replacing sole accuracy optimization with a ternary reward that differentiates correct answers, abstentions, and hallucinations. Using GRPO online RL and knowledge boundary probing, it demonstrates substantial reductions in hallucinations and increases in truthfulness across four knowledge-intensive benchmarks, in both retrieval and non-retrieval settings. Ablation studies show the simple ternary reward often outperforms binary or more complex rewards, and the approach remains robust across model scales and evaluators. The work highlights the importance of reward design and knowledge-boundary awareness for producing truthful, calibrated LLMs and suggests avenues for augmenting TruthRL with reasoning-aware signals.

Abstract

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TL;DR

TruthRL introduces a truthfulness-focused RL framework for LLMs, replacing sole accuracy optimization with a ternary reward that differentiates correct answers, abstentions, and hallucinations. Using GRPO online RL and knowledge boundary probing, it demonstrates substantial reductions in hallucinations and increases in truthfulness across four knowledge-intensive benchmarks, in both retrieval and non-retrieval settings. Ablation studies show the simple ternary reward often outperforms binary or more complex rewards, and the approach remains robust across model scales and evaluators. The work highlights the importance of reward design and knowledge-boundary awareness for producing truthful, calibrated LLMs and suggests avenues for augmenting TruthRL with reasoning-aware signals.

Abstract

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

Paper Structure

This paper contains 27 sections, 5 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Comparison between vanilla supervised fine-tuning (SFT), reinforcement learning (RL), and TruthRL. In vanilla SFT/RL, the model is optimized solely for accuracy, implicitly rewarding hallucinations over abstentions and thus always attempting to answer or guess, which ultimately compromises truthfulness. In contrast, TruthRL not only rewards correct answers, but explicitly penalizes hallucinations, and treats abstentions neutrally, thereby leading to greater truthfulness.
  • Figure 2: Scaling curve of prompting and vanilla SFT/RL methods on the CRAG benchmark yang2024crag, using Llama3.1-8B-Instruct as the backbone. Before training, the model shows strong potential in majority@$k$ scaling, with reduced hallucination and improved accuracy and abstentions as the number of responses increases. However, despite their slightly improved accuracy, vanilla SFT and RL diminish this potential and lead to much higher hallucinations, underscoring their limitations and the need for a more truthful training paradigm.
  • Figure 3: Performance decomposed to accuracy (blue), hallucination (red), and uncertainty (gray). Compared to baselines, TruthRL achieves the highest overall accuracy and the lowest hallucination. On difficult questions where almost no method can provide correct answers, TruthRL produces minimal hallucinations while other methods hallucinate heavily, demonstrating its improved capability in recognizing knowledge boundaries.
  • Figure 4: Learning dynamics of TruthRL under different reward designs. The dashed lines labeled as "Enhanced’’ represent the knowledge-enhanced reward schemes (defined in \ref{['sec:method_truthrl']}).
  • Figure 5: Study of model behaviors under different output confidence on CRAG.