Table of Contents
Fetching ...

I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield

Abstract

Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions -- explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles -- can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.

I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

Abstract

Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions -- explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles -- can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.

Paper Structure

This paper contains 81 sections, 3 theorems, 50 equations, 19 figures, 22 tables.

Key Result

Proposition 1

Let $(U,E)$ denote the reported uncertainty and error indicator, respectively, for a fresh example drawn from the same distribution, where $E=1$ iff the final answer is incorrect. Assume the calibration examples and future examples are i.i.d. draws from a common distribution over $(U,E)$. Let $\math Define with the convention $R(u)=0$ when $\Pr(U\le u)=0$. Define the one-sided Clopper--Pearson up

Figures (19)

  • Figure 1: Overview of the two-stage prompting protocol and downstream analysis.
  • Figure 2: Normative principles oriented toward truthfulness, humility, and responsibility.
  • Figure 3: Cross-model PopQA comparison under representative setups. Panels show $\mathrm{FAR}_{\mathrm{answered}}$, AER, $\widehat{\text{ECE}}$ for first-round answers, and Brier score for first-round answers, under Pure Eval, Scheme A $(+1,-1)$, Scheme B $(+1,-1,+0.4)$, and Scheme B with norms $(+1,-1,+0.4)$.
  • Figure 4: AER for GPT-5 mini on PopQA across reward configurations. The values in parentheses on the x-axis represent abstention reward.
  • Figure 5: Distribution of PopQA questions that were incorrect under Pure Eval, reclassified under the representative scheme setups (GPT-5 mini).
  • ...and 14 more figures

Theorems & Definitions (6)

  • Proposition 1: Finite-sample FAR control via CP-UCB
  • proof : Proof sketch
  • Proposition 2: Finite-sample FAR control via multistart fixed-sequence CP
  • proof : Proof sketch
  • Proposition 3
  • proof