Table of Contents
Fetching ...

Interactive Agents to Overcome Ambiguity in Software Engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig

TL;DR

Ambiguity in user instructions poses a major challenge for AI-driven software engineering tasks. The authors evaluate open and proprietary LLMs within an agentic, interactive framework across three tasks—interactive problem solving, ambiguity detection, and question quality—using SWE-Bench Verified as the evaluation corpus. They find that interactivity substantially improves performance on underspecified inputs (e.g., Claude Sonnet 3.5 and Haiku 3.5 reach up to about 80% of well-specified performance), but many models struggle to reliably detect ambiguity or generate highly informative clarifying questions, highlighting gaps in current approaches. The work emphasizes the need for dedicated training and improved interaction strategies to reduce misalignment, safety risks, and wasted resources in real-world, task-oriented code generation.

Abstract

AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.

Interactive Agents to Overcome Ambiguity in Software Engineering

TL;DR

Ambiguity in user instructions poses a major challenge for AI-driven software engineering tasks. The authors evaluate open and proprietary LLMs within an agentic, interactive framework across three tasks—interactive problem solving, ambiguity detection, and question quality—using SWE-Bench Verified as the evaluation corpus. They find that interactivity substantially improves performance on underspecified inputs (e.g., Claude Sonnet 3.5 and Haiku 3.5 reach up to about 80% of well-specified performance), but many models struggle to reliably detect ambiguity or generate highly informative clarifying questions, highlighting gaps in current approaches. The work emphasizes the need for dedicated training and improved interaction strategies to reduce misalignment, safety risks, and wasted resources in real-world, task-oriented code generation.

Abstract

AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.

Paper Structure

This paper contains 35 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Interactive agents mitigate resource wastage and reduce misalignment in ambiguous settings.
  • Figure 2: The three settings in order: Full, Hidden, and Interaction
  • Figure 3: Resolve rates (in %) across different settings: Hidden (underspecified issues), Interaction (underspecified issues with user interaction), and Full (fully specified issues)
  • Figure 4: Agent questions and user responses to the same underspecified input are shown for Llama 3.1 70B, Deepseek-v2, and Claude Haiku 3.5. The examples highlight specific interaction patterns and differences in handling ambiguity. The corresponding model inputs are detailed in Table \ref{['tab:issue-analysis']}.
  • Figure 5: Information Gain measured using (a) Cosine Distance Scores and (b) LLM-as-Judge Scores