Table of Contents
Fetching ...

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Joey Hong, Anca Dragan, Sergey Levine

TL;DR

This paper tackles long-horizon planning in goal-directed LLM interactions by introducing PNLC, which trains a lightweight goal-conditioned value function offline to act as a natural language critic at inference. By evaluating high-level thoughts rather than low-level actions and avoiding inference-time search, PNLC achieves efficient, scalable planning for frontier LLMs. Across web shopping, social deduction, and persuasion tasks, PNLC outperforms RL fine-tuning and prompting-based methods, with lower compute costs. The work demonstrates a practical path to integrating offline value estimation with opinionated, multi-turn reasoning in LLM agents.

Abstract

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

TL;DR

This paper tackles long-horizon planning in goal-directed LLM interactions by introducing PNLC, which trains a lightweight goal-conditioned value function offline to act as a natural language critic at inference. By evaluating high-level thoughts rather than low-level actions and avoiding inference-time search, PNLC achieves efficient, scalable planning for frontier LLMs. Across web shopping, social deduction, and persuasion tasks, PNLC outperforms RL fine-tuning and prompting-based methods, with lower compute costs. The work demonstrates a practical path to integrating offline value estimation with opinionated, multi-turn reasoning in LLM agents.

Abstract

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

Paper Structure

This paper contains 31 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: In an ongoing goal-oriented dialogue, we learn natural language value over the internal reasoning steps of an LLM agent. The value analyzes future positive and negative outcomes, and allow the LLM agent to refine its reasoning.
  • Figure 2: During offline training, training samples are summarized and embedded, and a goal-conditioned value function (which is just a small MLP) is trained over the embeddings.
  • Figure 3: During inference, a natural language critic uses the value function to produce an informative analysis of possible future outcomes. This natural language value is used by the LLM agent to refine its proposed reasoning.
  • Figure 4: The various tasks we consider, spanning tool use, games, and goal-oriented dialogue.
  • Figure 5: Example planning steps by our method. Left: In the social deduction task (Avalon), the agent originally intended to reveal the role of Player 2, which would have raised suspicion. After refinement, the agent keeps the information hidden. Right: In the persuasion task, the agent learns to address skepticism, better sharing how the charity takes accountability rather than focusing on their existing accomplishments.