Table of Contents
Fetching ...

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An, Yang Xu

TL;DR

This work tackles hallucinations in large language models by promoting post-hoc abstention guided by fine-grained semantic confidence. It introduces FiSCoRe, a reinforcement learning framework that uses semantic clustering of multiple samples to produce per-sample confidence rewards, enabling the model to answer only when consensus is strong. A new reliability metric, $F1_{rel}$, harmonically combines $F1_{ans}$ and $F1_{abs}$ to balance helpfulness and truthfulness and contrasts it with the limitations of existing RS. Empirical results on in-domain and out-of-domain QA tasks show that FiSCoRe yields more robust reliability and generalizes better across datasets, albeit with increased computation due to structured outputs and abstention signaling.

Abstract

Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

TL;DR

This work tackles hallucinations in large language models by promoting post-hoc abstention guided by fine-grained semantic confidence. It introduces FiSCoRe, a reinforcement learning framework that uses semantic clustering of multiple samples to produce per-sample confidence rewards, enabling the model to answer only when consensus is strong. A new reliability metric, , harmonically combines and to balance helpfulness and truthfulness and contrasts it with the limitations of existing RS. Empirical results on in-domain and out-of-domain QA tasks show that FiSCoRe yields more robust reliability and generalizes better across datasets, albeit with increased computation due to structured outputs and abstention signaling.

Abstract

Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on , which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

Paper Structure

This paper contains 33 sections, 23 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Accuracy-based abstention. (b) Entropy-based abstention. (c) Ours: fine-grained confidence-based abstention.
  • Figure 2: The training overview of our method. (a) The GRPO pipeline. (b) A detailed example of how FiSCoRe works.
  • Figure 3: (a) Experiments of $\text{F1}_{rel}$ and accuracy of various sampling number $G$ on TriviaQA on Qwen2.5-7B-Instruct. (b) Experiments of $\text{F1}_{ans}$, $\text{F1}_{abs}$ and $\text{F1}_{rel}$ of various accuracy reward weight $w_a$ on TriviaQA on Qwen2.5-7B-Instruct.
  • Figure 4: Percentage of prediction types among different methods. We choose SE-Tuning, GRPO-SE, and FiSCoRe.