Table of Contents
Fetching ...

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Nicholas Edwards, Sebastian Schuster

Abstract

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Abstract

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

Paper Structure

This paper contains 35 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the uncertainty-aware multi-agent scaffold. The Intent Agent analyzes the state history at each turn to detect underspecification, halting execution to constrain the Main Agent to query the user if missing information is required.
  • Figure 2: Task resolve rates (in %) across evaluation settings. Explicitly separating underspecification detection and code execution allows UA-Multi (69.40%) to significantly outperform UA-Single (61.20%, $p < 0.001$), closing the gap with the explicitly prompted Interactive Baseline. All reported $p$-values are computed via non-parametric permutation tests.
  • Figure 3: The base SWE-bench task prompt provided to the agents. The highlighted block in bold indicates the explicit clarification instructions added in the interactive variant, provided only to the Interactive Baseline.
  • Figure 4: Prompts required for the custom uncertainty-aware agent scaffolds. Part A shows the reminder prompt for the Uncertainty-Aware (Single) agent at each turn. Part B shows the system prompt for the specialized Intent Agent in the Uncertainty-Aware (Multi) agent scaffold.
  • Figure 5: The prompt provided to the user simulator, including specific guardrails to prevent unintended test modifications (rule 5) and resolve environment directory mismatches (rule 6).
  • ...and 3 more figures