Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel
TL;DR
The paper introduces a dynamic planning framework for LLM agents that learns when to allocate test-time compute for planning, balancing guiding advantages against costs to avoid instability from excessive reasoning. A two-stage training pipeline—supervised fine-tuning (SFT) on diverse planning data followed by reinforcement learning (RL)—yields agents that plan strategically, execute plans, and replan when needed in long-horizon tasks. Empirical results in POGS and Crafter reveal a Goldilocks planning regime where intermediate planning frequency outperforms always- or never-planning, with SFT priming improving imitation learning and RL enabling robust, steerable planning behavior. The work demonstrates that dynamic planning can be learned, scaled, and guided by human plans, advancing safer, more efficient, and collaborative agentic systems.
Abstract
Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.
