Table of Contents
Fetching ...

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

Shani Goren, Ido Galil, Ran El-Yaniv

TL;DR

The paper tackles unreliable long-form generation by introducing Selective Abstraction (SA), which trades specificity for reliability by replacing uncertain content with higher-confidence abstractions. SA uses a four-stage atom-wise pipeline and formalizes the trade-off with selective risk and coverage, employing RC curves and the AURC metric to quantify performance. An end-to-end evaluation across six open-source LLMs on FactScore and LongFact-Objects shows up to a 27.73% improvement in AURC, demonstrating that reducing detail in uncertain parts can boost factual accuracy while preserving meaning. A conformal-thresholding method is proposed to select risk-targeted abstraction levels with probabilistic guarantees. Overall, SA provides a principled approach to safer, more reliable long-form generation by controlling information density at the claim level.

Abstract

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

TL;DR

The paper tackles unreliable long-form generation by introducing Selective Abstraction (SA), which trades specificity for reliability by replacing uncertain content with higher-confidence abstractions. SA uses a four-stage atom-wise pipeline and formalizes the trade-off with selective risk and coverage, employing RC curves and the AURC metric to quantify performance. An end-to-end evaluation across six open-source LLMs on FactScore and LongFact-Objects shows up to a 27.73% improvement in AURC, demonstrating that reducing detail in uncertain parts can boost factual accuracy while preserving meaning. A conformal-thresholding method is proposed to select risk-targeted abstraction levels with probabilistic guarantees. Overall, SA provides a principled approach to safer, more reliable long-form generation by controlling information density at the claim level.

Abstract

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.
Paper Structure (28 sections, 1 theorem, 15 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 15 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem H.1

Suppose the claims in the calibration set $A_{cal}=\{a_k\}_{i=1}^{n}$ and a given test claim $a_{n+1}$ are exchangeable. For any target risk $\alpha\in(0,1)$ and $\delta$, define $\theta_{n+1}$, $\hat{\theta}$ and $\epsilon$ as in Algorithm alg:sa_threshold, and $R(\hat{\theta})=P(\theta_{n+1}>\hat{

Figures (9)

  • Figure 1: Left: Example abstraction sequence for atom-wise selective abstraction (SA). Increasing the confidence threshold replaces low-confidence atoms with less specific, more reliable abstractions. Right: Example risk-coverage curve comparing atom-wise SA to baselines. Model: gpt-oss-120b, dataset: FactScore.
  • Figure 2: An overview of the Selective Abstraction framework. The generated text is decomposed into atoms, and low-confidence atoms are replaced with confident abstractions. The confidence threshold in this example is 85$\%$.
  • Figure 3: Qualitative example of atom-wise Selective Abstraction at two confidence thresholds: a higher threshold reduces detail but improves reliability, while a lower threshold retains detail at higher risk. Red text signifies incorrect claims. Generating model: gpt-oss-120b, dataset: FactScore. See Appendix \ref{['app:sa_examples']} for additional examples.
  • Figure 4: Risk-guided threshold selection on FactScore (left) and LongFact (right), averaged across models (600 runs per dataset). Blue points show mean risk at the selected thresholds, bands show one standard deviation, and red X marks the target risk.
  • Figure 5: Risk–coverage curves obtained by instantiating $\kappa$ with different logprob-based confidence scores on FactScore using gpt-oss-120b. Legend is sorted by AURC in ascending order.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem H.1
  • proof