Table of Contents
Fetching ...

Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

TL;DR

The paper introduces a dynamic planning framework for LLM agents that learns when to allocate test-time compute for planning, balancing guiding advantages against costs to avoid instability from excessive reasoning. A two-stage training pipeline—supervised fine-tuning (SFT) on diverse planning data followed by reinforcement learning (RL)—yields agents that plan strategically, execute plans, and replan when needed in long-horizon tasks. Empirical results in POGS and Crafter reveal a Goldilocks planning regime where intermediate planning frequency outperforms always- or never-planning, with SFT priming improving imitation learning and RL enabling robust, steerable planning behavior. The work demonstrates that dynamic planning can be learned, scaled, and guided by human plans, advancing safer, more efficient, and collaborative agentic systems.

Abstract

Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.

Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

TL;DR

The paper introduces a dynamic planning framework for LLM agents that learns when to allocate test-time compute for planning, balancing guiding advantages against costs to avoid instability from excessive reasoning. A two-stage training pipeline—supervised fine-tuning (SFT) on diverse planning data followed by reinforcement learning (RL)—yields agents that plan strategically, execute plans, and replan when needed in long-horizon tasks. Empirical results in POGS and Crafter reveal a Goldilocks planning regime where intermediate planning frequency outperforms always- or never-planning, with SFT priming improving imitation learning and RL enabling robust, steerable planning behavior. The work demonstrates that dynamic planning can be learned, scaled, and guided by human plans, advancing safer, more efficient, and collaborative agentic systems.

Abstract

Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.

Paper Structure

This paper contains 36 sections, 5 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Dynamic planning strategies across environments and training stages. (a-b) Zero-shot results showing optimal "Goldilocks" planning frequency in Crafter and POGS (100 seeds, bars=standard-error). (c-d) SFT results demonstrating planning agents' improved performance with lower KL divergence from base model. (e-f) RL results where SFT-primed planning agents are more sample efficient than non-planning baselines and more consistently reach complex achievements.
  • Figure 2: Dynamic Planning Agent Architecture. Our agent is a single, monolithic LLM whose conceptual policies are realized through its unified output format. The decision to plan ($\phi_\theta$) is made implicitly by the model's choice to begin its generation with a <plan> token. This single output string is then parsed to extract the action ($a_t$) and, if present, the new plan ($p_t$), thereby executing the acting ($\pi_\theta$) and planning ($\psi_\theta$) policies.
  • Figure 3: Human-Agent collaboration in Crafter. We show an example where a human guides the agent with high-level plans to clear a cave from a skeleton, and create a shelter to survive the night, a complex behaviour that was not observed in any of the training runs otherwise.
  • Figure 4: An illustration of the agent's input context over two timesteps, $t$ and $t+1$. At each timestep, the agent processes a chat-formatted history composed of a system prompt, user messages (green observation and gray instruction), and assistant messages (yellow plan and blue action). The agent receives the history of interactions and in this case it generates a new plan and action. In the subsequent timestep $t+1$, the input history is updated: the plan and action generated at $t$ are appended to the interaction history together with new observation, and the previous plan is removed. During experiments, to manage context length, the history provided to the agent was truncated to a maximum of 16 observations.
  • Figure 5: POGS and Crafter environments. POGS (left): Agent navigates a procedurally generated graph with partial visibility. Crafter (right): Agent receives natural language descriptions of terrain, resources, and creatures with their relative positions.
  • ...and 10 more figures