Table of Contents
Fetching ...

Learning to Reason for Hallucination Span Detection

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, Raviteja Vemulapalli

TL;DR

This paper addresses the challenge of reliably detecting hallucination spans in LLM outputs by formulating hallucination span detection as a span-level reasoning task trained with reinforcement learning. It introduces RL4HS, which uses a span-F1–based reward within Group Relative Policy Optimization and couples it with Class-Aware Policy Optimization (CAPO) to balance class contributions. Empirical results on the RAGTruth benchmark show that RL4HS substantially outperforms pretrained reasoning and supervised finetuning baselines, with CAPO mitigating reward hacking and improving precision–recall balance. The work demonstrates that in-domain, span-level reasoning learned through RL is crucial for robust hallucination span detection and establishes a practical framework for integrating explicit reasoning into fine-grained detection tasks.

Abstract

Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

Learning to Reason for Hallucination Span Detection

TL;DR

This paper addresses the challenge of reliably detecting hallucination spans in LLM outputs by formulating hallucination span detection as a span-level reasoning task trained with reinforcement learning. It introduces RL4HS, which uses a span-F1–based reward within Group Relative Policy Optimization and couples it with Class-Aware Policy Optimization (CAPO) to balance class contributions. Empirical results on the RAGTruth benchmark show that RL4HS substantially outperforms pretrained reasoning and supervised finetuning baselines, with CAPO mitigating reward hacking and improving precision–recall balance. The work demonstrates that in-domain, span-level reasoning learned through RL is crucial for robust hallucination span detection and establishes a practical framework for integrating explicit reasoning into fine-grained detection tasks.

Abstract

Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

Paper Structure

This paper contains 36 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Span-F1@K for different number of predictions $K$. Using CoT reasoning provides significant boost as $K$ increases clearly demonstrating the potential of CoT reasoning.
  • Figure 2: Expected values of advantage given to Qwen2.5-7B-Instruct pretrained model predictions based on the prediction type. Values are shown separately for the three task-based splits of the RAGTruth dataset.
  • Figure 3: Advantage distribution by model predictions. Advantage distributions across tasks on Qwen2.5-7B-Instruct pretrained model. Non-hallucination predictions (red) receive higher advantages than hallucination predictions (blue), revealing a class imbalance issue.
  • Figure 4: Training dynamics of GRPO (red) and CAPO (blue) on Qwen2.5-7B-Instruct model. While GRPO exhibits high precision but declining recall due to reward hacking, CAPO stabilizes recall without sacrificing precision, yielding consistently higher span F1. Shaded regions denote standard deviations across runs.
  • Figure 5: Out-of-domain evaluation on RAGTruth. Span-F1 scores on Ragtruth dataset. Our RL4HS-OOD-7B model performs competitively with larger reasoning models, showing the benefit of span-level reward fine-tuning. We use Instruct models for Qwen2.5 models.
  • ...and 2 more figures