LHAW: Controllable Underspecification for Long-Horizon Tasks
George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton
TL;DR
LHAW addresses the gap in evaluating long-horizon autonomous agents under underspecification by introducing a dataset-agnostic pipeline that synthetically creates controllable information gaps across four dimensions (Goal, Constraint, Input, Context) and validates them through empirical agent trials. The framework operates in three phases—Segment Extraction, Candidate Generation, and Empirical Validation—to produce 285 underspecified variants across TheAgentCompany, SWE-Bench Pro, and MCP-Atlas, labeling each variant as outcome-critical, divergent, or benign. It also incorporates a simulated user via an ask_user tool to study clarification behavior, introduces the Gain/Q metric for efficiency of information gain per question, and analyzes how model choice and prompting strategies affect clarification effectiveness. The work demonstrates that clarification can recover substantial performance, but efficiency is model-dependent, and clarifying questions can both help and hinder depending on timing, targeting, and strategy. Overall, LHAW provides a principled, cost-aware benchmark for probing when agents should seek clarification and how to balance interruption costs against information value in long-horizon autonomous systems.
Abstract
Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
