TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents
Jongwon Jeong, Jungtaek Kim, Kangwook Lee
TL;DR
This work tackles irrecoverable failures of language-model agents operating under strict feasibility constraints by formalizing the problem as a goal-conditioned MDP with budgets and identifying planning and sampling errors as core bottlenecks. It introduces TAPE, a framework that builds a plan graph from multiple candidate plans, uses an external solver to select a feasible path, and enforces constrained decoding to execute the plan precisely, with replanning upon mismatches. Theoretical analysis shows TAPE offers a higher per-step and overall success probability than ReAct or Plan-and-Act by reducing planning errors and suppressing sampling errors, and empirical results across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate substantial gains, especially for weaker models and harder tasks. Limitations include reliance on plan-graph fidelity and solver choice, guiding future work toward more robust graph construction and adaptive solver selection to broaden applicability.
Abstract
Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.
