Table of Contents
Fetching ...

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Andrew Szot, Michael Kirchhof, Omar Attia, Alexander Toshev

TL;DR

Strategic-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy, and shows that SGE enables the agent to learn to solve tasks too difficult for the base model.

Abstract

Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

TL;DR

Strategic-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy, and shows that SGE enables the agent to learn to solve tasks too difficult for the base model.

Abstract

Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
Paper Structure (23 sections, 2 equations, 11 figures, 2 tables)

This paper contains 23 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of Strategy-Guided Exploration (SGE). SGE improves reinforcement learning (RL) training on hard agentic tasks for which the base model fails, even after many attempts (left of figure). SGE addresses this by having the LLM policy output a language "strategy" and then conditioning the action generation on this strategy. The reasoning capabilities of the LLM enable it to output diverse strategies through the techniques of: (1) mixed-temperature sampling, where strategy tokens are sampled with higher temperature than remaining tokens, and (2) strategy reflection, where strategies are generated to be distinct from other strategies executed earlier in RL training. This enables SGE to explore to solve hard tasks that the base model is not capable of succeeding in (right of figure).
  • Figure 2: RL training curves for SGE and baselines in each environment. Training curves show the average pass@1 rate versus the number of RL updates for the training tasks across 3 random seeds in training runs, with the shaded area representing the standard deviation of the pass@1 rate across the seeds. The horizontal lines indicate the base model pass@$k$ rate with lighter shades meaning a higher $k$. In the Coding environment, SGE and GRPO are trained longer than the other approaches to show better converged performance. RND and RLAD baselines are also only shown for this environment since they underperform the GRPO and EntropyAdv.
  • Figure 3: Comparing the pass@$k$ of the final policies trained by SGE and GRPO and the starting base model across the environments on the training tasks. We compute the pass@$k$ for $k$ in powers of 2, and the x-axis is on a logarithmic scale. The "Base" line shows that the pass@$k$ of the base model eventually plateaus for large enough $k$. GRPO raises the "Base" curve, enabling the pass@1 to closely match the pass@$k$ for the highest $k$. On the other hand, SGE exceeds this pass@$k$ ceiling of the base model and enables solving new problems that are unsolvable by the base model. We report averages across all tasks in the training set.
  • Figure 4: Left: we ablate the parallel strategy sampling by comparing to sampling all tokens with the uniformly high temperature used to sample the strategy tokens or the LLM default temperature. Middle we ablate the sequential strategy reflection by removing SGE's ability to reflect on successful strategies, negative strategies, or any strategy. Right: Effect of scaling Qwen3 base model parameter count on RL training with and without SGE. Ablation results are over 3 random seeds, and scaling results are over 1 random seed.
  • Figure 5: Left: Comparing the effect of different mixed-temperature sampling settings on the zero-shot pass@16 performance in the Coding environment. The high-values in the top-left indicate that a mixed-temperature sampling produces the best results. Right: Visualizing actions predicted by SGE vs. regular sampling in the MarkorCreateNote task in AndroidWorld. 16 actions are sampled for the same observation screenshot with the tap locations visualized as colored circles overlaid on the screen (non-tap actions are not visualized). To complete the task, the agent must tap the file extension dropdown. Unlike with regular sampling, SGE is able to sample this correct action.
  • ...and 6 more figures