Table of Contents
Fetching ...

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z. Li, Been Kim, Zi Wang

TL;DR

QuestBench formalizes underspecification in reasoning as 1-sufficient CSPs and builds four domains (Logic-Q, Planning-Q, GSM-Q, GSME-Q) to evaluate LLMs on acquiring the necessary information via a single clarifying question. It systematically analyzes model performance, correlations with problem difficulty, and ablations on well-specified variants to separate reasoning from information acquisition. The findings show strong performance on math-based GSM tasks but notable gaps in logic and planning domains, indicating that effective information-gathering is a distinct capability beyond mere reasoning with sufficient information. The work provides a formal framework, a rigorous, ground-truth benchmark, and directions for extending to more complex, multi-question CSPs and realistic user simulations, with implications for safer, more interactive AI systems.

Abstract

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

TL;DR

QuestBench formalizes underspecification in reasoning as 1-sufficient CSPs and builds four domains (Logic-Q, Planning-Q, GSM-Q, GSME-Q) to evaluate LLMs on acquiring the necessary information via a single clarifying question. It systematically analyzes model performance, correlations with problem difficulty, and ablations on well-specified variants to separate reasoning from information acquisition. The findings show strong performance on math-based GSM tasks but notable gaps in logic and planning domains, indicating that effective information-gathering is a distinct capability beyond mere reasoning with sufficient information. The work provides a formal framework, a rigorous, ground-truth benchmark, and directions for extending to more complex, multi-question CSPs and realistic user simulations, with implications for safer, more interactive AI systems.

Abstract

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.

Paper Structure

This paper contains 63 sections, 20 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: A multi-choice question-asking task in QuestBench with ground truth answers for accuracy evaluation. We construct question choices using the CSP translated from the verbal problem.
  • Figure 2: An example in Logic-Q. The prompt provided to the LM is on the left hand side. The ground truth answer is in red. The symbolic CSP used to construct the questions is shown on the right hand side.
  • Figure 3: An example in Planning-Q. The ground-truth answer is given in red. The prompt given to the LM (left) includes the full task specification in PDDL, which we omit for simplicity and instead display visually. Possible initial states are constructed from the partial initial state and are grouped based on plans to the goal. These groups of initial states are used for constructing the questions.
  • Figure 4: LM accuracies across varying backwards search depths $d$, number variables $|X|$, number constraints $|C|$, and expected number brute-force guesses $\mathbb{E}_\text{BF}$ for each domain, model, and prompt setting. To make the graph less noisy, we aggregate units of 5 on the $x$-axis for the Logic-Q setting for $|X|$ and $|C|$.
  • Figure 5: Screenshot of the annotation interface used for obtaining CSPs for each math problem in the GSM setting.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Example 3.1
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3