Table of Contents
Fetching ...

TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

Jongwon Jeong, Jungtaek Kim, Kangwook Lee

TL;DR

This work tackles irrecoverable failures of language-model agents operating under strict feasibility constraints by formalizing the problem as a goal-conditioned MDP with budgets and identifying planning and sampling errors as core bottlenecks. It introduces TAPE, a framework that builds a plan graph from multiple candidate plans, uses an external solver to select a feasible path, and enforces constrained decoding to execute the plan precisely, with replanning upon mismatches. Theoretical analysis shows TAPE offers a higher per-step and overall success probability than ReAct or Plan-and-Act by reducing planning errors and suppressing sampling errors, and empirical results across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate substantial gains, especially for weaker models and harder tasks. Limitations include reliance on plan-graph fidelity and solver choice, guiding future work toward more robust graph construction and adaptive solver selection to broaden applicability.

Abstract

Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.

TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

TL;DR

This work tackles irrecoverable failures of language-model agents operating under strict feasibility constraints by formalizing the problem as a goal-conditioned MDP with budgets and identifying planning and sampling errors as core bottlenecks. It introduces TAPE, a framework that builds a plan graph from multiple candidate plans, uses an external solver to select a feasible path, and enforces constrained decoding to execute the plan precisely, with replanning upon mismatches. Theoretical analysis shows TAPE offers a higher per-step and overall success probability than ReAct or Plan-and-Act by reducing planning errors and suppressing sampling errors, and empirical results across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate substantial gains, especially for weaker models and harder tasks. Limitations include reliance on plan-graph fidelity and solver choice, guiding future work toward more robust graph construction and adaptive solver selection to broaden applicability.

Abstract

Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.
Paper Structure (59 sections, 3 theorems, 27 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 59 sections, 3 theorems, 27 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Assume that any successful trajectory must take at least $T$ action selections. Then, the ReAct success probability is bounded by $U_{\mathrm{ReAct}}$: where equality holds when the task exactly ends at $T$. If $(1-\epsilon_{\mathrm{p}})\delta_{\mathrm{b}} \ge \epsilon_{\mathrm{p}}\delta_{\mathrm{r}}$, $U_{\mathrm{ReAct}}$ increases as $\epsilon_{\mathrm{p}}$ and $\epsilon_{\mathrm{s}}$ decrease.

Figures (7)

  • Figure 1: Overview. We illustrate our work using Sokoban, where the goal is to push all boxes onto target locations. (a) Sources of Irrecoverable Failure in the ReAct Framework. A planning error occurs when the internal reasoning suggests a non-viable action (e.g., pushing a box against a wall); this makes the goal unachievable as the agent cannot pull the box from the wall, while a sampling error arises when LM stochasticity leads to an action deviating from the plan. (b) Conceptual Toy Analysis. We model simplified agents by injecting planning and sampling errors into a feasible policy for Sokoban. We measure success rates as the task step $T$ increases, observing that existing frameworks degrade rapidly as $T$ grows. See \ref{['app:concept_proof_details']} for details. (c) Our Framework. TAPE generates and aggregates multiple plans into a graph and uses a solver to select a feasible path, thereby reducing planning errors. Then, it enforces constrained execution to suppress sampling errors.
  • Figure 2: Overview of TAPE. The proposed framework consists of four steps. (a) Plan Graph Construction: The LM samples multiple trajectories, which are aggregated into a plan graph with predicted costs and scores. (b) Plan Path Selection: An external solver (e.g., ILP) identifies the optimal path (blue arrows) subject to constraints. (c) Constrained Execution: The agent executes the selected actions using constrained decoding to eliminate sampling errors. (d) Mismatch Check: If a mismatch occurs between the predicted and realized state, the agent re-performs plan graph construction and path selection; otherwise, the agent executes the next planned action.
  • Figure 3: Success rates across four agentic tasks. We evaluate our framework against ReAct and Plan-and-Act on Sokoban, ALFWorld, Musique, and GSM-Hard. We use gpt-4.1-mini for LM backbone. We find that TAPE consistently demonstrates superior performance over existing frameworks in both easy and hard settings.
  • Figure 4: Cross-model success rates and sensitivity to the number of generated plans in Sokoban.(a) Success rates of TAPE and baselines across LMs with different planning capabilities. TAPE consistently improves over other frameworks, with larger gains on weaker models, indicating effective mitigation of planning errors. (b) Sensitivity of TAPE to the number of generated plans $M$ used to construct the plan graph. The best performance is achieved at $M=4$, suggesting that moderate plan aggregation via node merging effectively expands the candidate action space.
  • Figure 5: Impact of larger step budgets and success-cost trade-off on Sokoban.(a) Success rate as a function of the normalized step budget $B/S_{\min}$, where $S_{\min}$ denotes the oracle minimum number of steps (here $S_{\min}=6$ and $B\in\{8,10,14,22\}$). Error bars indicate $\pm$ standard error across runs. Our framework is more higher success rate as step budget increases. (b) Success rate versus step consumption (steps used divided by the budget). Methods closer to the top-left achieve higher success while consuming fewer steps per budget, indicating a better success--cost trade-off. TAPE can be both efficient and powerful compared to other frameworks.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2: Success Probability
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof
  • proof
  • proof