Table of Contents
Fetching ...

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette

Abstract

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

Abstract

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
Paper Structure (78 sections, 39 equations, 20 figures, 7 tables, 3 algorithms)

This paper contains 78 sections, 39 equations, 20 figures, 7 tables, 3 algorithms.

Figures (20)

  • Figure 1: Overview of the Milestoning the Agents
  • Figure 2: Automated Identification of Key Decision Steps. The analyzer compares the agent's trajectory against task intent and known successful patterns to pinpoint the exact step (e.g., Step 3) where a critical deviation occurred.
  • Figure 3: Overview of Failure mode distribution of existing out-of-box models
  • Figure 4: Validation of graded subgoal agreement. (Left) The ROC curve, using the fraction of completed subgoals $s_i$ as a score, yields a high AUROC of 0.84. (Right) The Monotonicity Calibration plot shows a strictly increasing probability of success $P(y=1 \mid m)$ as more subgoals $m$ are completed. Both plots confirm the subgoals are a reliable progress signal.
  • Figure 5: Dynamic Milestoning Framework for Enhanced LLM Agent Inference. The architecture depicts the real-time feedback loop where the online agent's actions are monitored against a SubGoals Checklist. The reasoning model itself uses trace reflection to determine progress ($\mathbf{z}_{t+1}$), providing a dense, grounded signal that directs the agent's next planning step and enables self-correction.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Claim 1: Wiewiora, 2003, adapted for goal-conditioning
  • proof : Proof by Induction