Interactive Agents to Overcome Ambiguity in Software Engineering
Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
TL;DR
Ambiguity in user instructions poses a major challenge for AI-driven software engineering tasks. The authors evaluate open and proprietary LLMs within an agentic, interactive framework across three tasks—interactive problem solving, ambiguity detection, and question quality—using SWE-Bench Verified as the evaluation corpus. They find that interactivity substantially improves performance on underspecified inputs (e.g., Claude Sonnet 3.5 and Haiku 3.5 reach up to about 80% of well-specified performance), but many models struggle to reliably detect ambiguity or generate highly informative clarifying questions, highlighting gaps in current approaches. The work emphasizes the need for dedicated training and improved interaction strategies to reduce misalignment, safety risks, and wasted resources in real-world, task-oriented code generation.
Abstract
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.
