Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Dhruv Madhwal; Lyuxin David Zhang; Dan Roth; Tomer Wolfson; Vivek Gupta

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Dhruv Madhwal, Lyuxin David Zhang, Dan Roth, Tomer Wolfson, Vivek Gupta

TL;DR

The paper investigates reliability in closed-book QA by contrasting Direct, Assistive, and Incremental prompting across six multi-hop benchmarks and a range of model scales. It finds that decomposition improves non-frontier models but yields diminishing returns for frontier LLMs, while cross-regime disagreement becomes a strong, training-free signal of potential errors. Building on this, the authors introduce Disagreement-Based Abstention (DBA), a simple gate that abstains when direct and decomposed outputs disagree, outperforming standard uncertainty baselines in error detection. DBA remains effective across models and datasets and can be complemented by ensembling with other methods for even stronger reliability signals. Collectively, the work reframes decomposition as a diagnostic tool for model reliability in closed-book QA and highlights the practical value of cross-regime consistency for identifying fragile beliefs in large language models.

Abstract

Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

TL;DR

Abstract

Paper Structure (60 sections, 4 figures, 18 tables)

This paper contains 60 sections, 4 figures, 18 tables.

Introduction
Decomposition-based Prompting
Why these prompting regimes?
Verified Reference Decompositions
Experimental
Models.
Datasets.
Prompting Regimes.
Consistency Protocol.
Evaluation Measures.
Results and Analysis
Scale, Accuracy and Consistency.
Decomposition Gains Plateau in Frontier LLMs.
Consistency as an Answer Reliability Signal.
Accuracy-Consistency Correlation.
...and 45 more sections

Figures (4)

Figure 1: DSL decomposition for a multi-hop question.
Figure 2: Accuracy vs. consistency rate across 9 models and 6 datasets, grouped by difficulty. Each point represents a (model, dataset) pair. Marker shape encodes LLM size (Frontier, 70B, 32B, 8B) while colors encode different evaluation datasets.
Figure 3: Disagreement-Based Abstention (DBA) Framework. Our method compares a Direct answer against Assistive/Incremental reasoning paths, if the semantic claims disagree, the model abstains (IDK).
Figure 4: Incremental accuracy vs. Incremental consistency rate (Direct-Incremental agreement) across 9 models and 6 datasets. Similar to the Assistive case (Figure \ref{['fig:accuracy-vs-consistency-all']}), we observe strong positive correlations, confirming that the relationship between consistency and correctness generalizes across reasoning interfaces.

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

TL;DR

Abstract

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Authors

TL;DR

Abstract

Table of Contents

Figures (4)