Table of Contents
Fetching ...

Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

Yudong Wang, Zhe Yang, Wenhan Ma, Zhifang Sui, Liang Zhao

TL;DR

The paper tackles the reliability-capability trade-off in reinforcement learning for large language models by distinguishing intrinsic and extrinsic hallucinations and proposing a targeted RL framework that uses three data formats (short-form QA, long-form QA with references, long-form QA without references) and tailored reward signals. It introduces novel training data from TriviaQA and FineWeb to address extrinsic and intrinsic hallucinations, respectively, and demonstrates that explicit refusal instructions and summarized chain-of-thought supervision strike a balance between reasoning and factuality. The results show substantial improvements in hallucination reduction and reliability across two base models and diverse benchmarks, while also revealing trade-offs between verbosity and accuracy and highlighting limitations such as evaluator dependence and data diversity. Overall, the work provides practical guidance for aligning RL-based LLMs toward trustworthy, cautious, and capable QA systems across formats.

Abstract

While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.

Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

TL;DR

The paper tackles the reliability-capability trade-off in reinforcement learning for large language models by distinguishing intrinsic and extrinsic hallucinations and proposing a targeted RL framework that uses three data formats (short-form QA, long-form QA with references, long-form QA without references) and tailored reward signals. It introduces novel training data from TriviaQA and FineWeb to address extrinsic and intrinsic hallucinations, respectively, and demonstrates that explicit refusal instructions and summarized chain-of-thought supervision strike a balance between reasoning and factuality. The results show substantial improvements in hallucination reduction and reliability across two base models and diverse benchmarks, while also revealing trade-offs between verbosity and accuracy and highlighting limitations such as evaluator dependence and data diversity. Overall, the work provides practical guidance for aligning RL-based LLMs toward trustworthy, cautious, and capable QA systems across formats.

Abstract

While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.

Paper Structure

This paper contains 33 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The three task formats for evaluating and mitigating hallucinations. In Short-form QA, answers are directly verified. In Long-form QA with reference, claims are checked against the provided text to assess for intrinsic hallucinations. In Long-form QA without reference, claims are checked against search results to assess for extrinsic hallucinations.
  • Figure 2: Composition of the training data constructed from the FineWeb dataset. The left chart illustrates the distribution of subject domains for the filtered source contexts, while the right chart shows the distribution of the types of questions generated based on those contexts.
  • Figure 3: Performance Comparison of CoT Supervision Strategies Across Three Benchmarks.
  • Figure 4: Training trajectory on TriviaQA. For MiMo-7B-RL-0530, the hallucination rate drops quickly and saturates early in training, after which accuracy begins to climb steadily.
  • Figure 5: Comparison of Penalty Functions for Balancing Verbosity and Accuracy. The training dynamics illustrate that directly penalizing for a low number of claims can increase model verbosity late in training, but this explicitly compromises accuracy. The LLM and win-rate penalties achieve a more stable, albeit more concise, performance.
  • ...and 1 more figures