Table of Contents
Fetching ...

LHAW: Controllable Underspecification for Long-Horizon Tasks

George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton

TL;DR

LHAW addresses the gap in evaluating long-horizon autonomous agents under underspecification by introducing a dataset-agnostic pipeline that synthetically creates controllable information gaps across four dimensions (Goal, Constraint, Input, Context) and validates them through empirical agent trials. The framework operates in three phases—Segment Extraction, Candidate Generation, and Empirical Validation—to produce 285 underspecified variants across TheAgentCompany, SWE-Bench Pro, and MCP-Atlas, labeling each variant as outcome-critical, divergent, or benign. It also incorporates a simulated user via an ask_user tool to study clarification behavior, introduces the Gain/Q metric for efficiency of information gain per question, and analyzes how model choice and prompting strategies affect clarification effectiveness. The work demonstrates that clarification can recover substantial performance, but efficiency is model-dependent, and clarifying questions can both help and hinder depending on timing, targeting, and strategy. Overall, LHAW provides a principled, cost-aware benchmark for probing when agents should seek clarification and how to balance interruption costs against information value in long-horizon autonomous systems.

Abstract

Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

LHAW: Controllable Underspecification for Long-Horizon Tasks

TL;DR

LHAW addresses the gap in evaluating long-horizon autonomous agents under underspecification by introducing a dataset-agnostic pipeline that synthetically creates controllable information gaps across four dimensions (Goal, Constraint, Input, Context) and validates them through empirical agent trials. The framework operates in three phases—Segment Extraction, Candidate Generation, and Empirical Validation—to produce 285 underspecified variants across TheAgentCompany, SWE-Bench Pro, and MCP-Atlas, labeling each variant as outcome-critical, divergent, or benign. It also incorporates a simulated user via an ask_user tool to study clarification behavior, introduces the Gain/Q metric for efficiency of information gain per question, and analyzes how model choice and prompting strategies affect clarification effectiveness. The work demonstrates that clarification can recover substantial performance, but efficiency is model-dependent, and clarifying questions can both help and hinder depending on timing, targeting, and strategy. Overall, LHAW provides a principled, cost-aware benchmark for probing when agents should seek clarification and how to balance interruption costs against information value in long-horizon autonomous systems.

Abstract

Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
Paper Structure (80 sections, 6 figures, 17 tables)

This paper contains 80 sections, 6 figures, 17 tables.

Figures (6)

  • Figure 1: The LHAW Synthetic Underspecification Pipeline. Three phases: (1) Segment Extraction & Scoring---identify removable segments, classify by dimension, estimate criticality/guessability; (2) Variant Generation---apply different removal strategies to create underspecified prompts with expected questions; (3) Empirical Validation---run agent trials to classify variants as outcome-critical, divergent, or benign.
  • Figure 2: Segment Extraction & Scoring. The pipeline identifies removable segments, classifying each by dimension (color) and scoring for criticality and guessability.
  • Figure 3: LHAW Benchmark Distribution. Final dataset of 285 variants across three benchmarks. Left: variant and unique task counts per dataset. Center: ambiguity class distribution (outcome-critical, divergent, benign). Right: information dimension distribution (goal, constraint, input, context); multi-segment removals in MCP-Atlas count each dimension separately.
  • Figure 4: Value of Information across models across tasks. This shows the pass@3 and Avg. Ckpt% overall performance with the ask_user tool plotted against the gain provided by each user question. The top right is the most capable agents who learn the most per user call. The bottom left is the least capable agents who learn the least per user call.
  • Figure 5: Full Taxonomy of ask_user Failure Modes. Using the judge prompt in Section \ref{['sec:ask-user-judge']}, we measure the frequency of trials flagged with each failure mode across the full taxonomy. Compound questions and missed critical segments dominate across ambiguity classes.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 2.1: Long-Horizon Workflow
  • Definition 2.2: Outcome-Critical Underspecification
  • Definition 3.1: Ambiguity Classification