Table of Contents
Fetching ...

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

TL;DR

The paper tackles the problem that scientific reasoning tasks can harmfully propagate incorrect conclusions when models are forced to answer. It proposes an abstention-aware verification pipeline that decomposes inputs into minimal conditions, audits each condition with a fixed NLI verifier, and aggregates outcomes under asymmetric loss to decide whether to answer or abstain. Across SciFact and PubMedQA, the approach shows that reducing coverage via confidence-based abstention consistently lowers risk, and that reliability is more strongly shaped by abstention strategy than by model size or architecture. This yields a practical, model-agnostic framework and evaluation paradigm (risk–coverage) for trustworthy scientific reasoning, with broader implications for selective reasoning in high-stakes domains.

Abstract

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

TL;DR

The paper tackles the problem that scientific reasoning tasks can harmfully propagate incorrect conclusions when models are forced to answer. It proposes an abstention-aware verification pipeline that decomposes inputs into minimal conditions, audits each condition with a fixed NLI verifier, and aggregates outcomes under asymmetric loss to decide whether to answer or abstain. Across SciFact and PubMedQA, the approach shows that reducing coverage via confidence-based abstention consistently lowers risk, and that reliability is more strongly shaped by abstention strategy than by model size or architecture. This yields a practical, model-agnostic framework and evaluation paradigm (risk–coverage) for trustworthy scientific reasoning, with broader implications for selective reasoning in high-stakes domains.

Abstract

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
Paper Structure (31 sections, 3 theorems, 32 equations, 2 figures, 2 tables)

This paper contains 31 sections, 3 theorems, 32 equations, 2 figures, 2 tables.

Key Result

Proposition 1

If $\mathrm{conf}$ is rank-calibrated with respect to $F$, $\ell$, and $\mathcal{D}$, then for any $\tau_1 < \tau_2$ with $\phi(\tau_2) > 0$,

Figures (2)

  • Figure 1: Overview of the abstention-aware scientific reasoning pipeline. An input claim or question is decomposed into a set of conditions. Each condition is independently evaluated against available evidence using the NLI verifier, producing condition-level judgments of support, contradiction, or missing evidence. These judgments are aggregated by a decision module that computes a confidence score and compares it against an abstention threshold $\tau$, emitting a final label only when the available evidence is sufficient.
  • Figure 2: Risk--coverage curves for (top) SciFact and (bottom) PubMedQA. Lower curves indicate better reliability under selective prediction.

Theorems & Definitions (9)

  • Definition 1: Selective Classifier
  • Definition 2: Selective Risk and Coverage
  • Definition 3: Rank-Calibration
  • Proposition 1: Monotonicity of Selective Risk
  • proof
  • Proposition 2: Finite-Sample Concentration
  • proof
  • Proposition 3: Bayes-Optimal Decision Threshold
  • proof