QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Belinda Z. Li, Been Kim, Zi Wang
TL;DR
QuestBench formalizes underspecification in reasoning as 1-sufficient CSPs and builds four domains (Logic-Q, Planning-Q, GSM-Q, GSME-Q) to evaluate LLMs on acquiring the necessary information via a single clarifying question. It systematically analyzes model performance, correlations with problem difficulty, and ablations on well-specified variants to separate reasoning from information acquisition. The findings show strong performance on math-based GSM tasks but notable gaps in logic and planning domains, indicating that effective information-gathering is a distinct capability beyond mere reasoning with sufficient information. The work provides a formal framework, a rigorous, ground-truth benchmark, and directions for extending to more complex, multi-question CSPs and realistic user simulations, with implications for safer, more interactive AI systems.
Abstract
Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.
