Table of Contents
Fetching ...

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine

TL;DR

The paper tackles the training inefficiencies of long-horizon LLM agents by introducing Natural Language Actor-Critic (NLAC), which replaces scalar value functions with a natural language critic that reasons about future outcomes. The critic generates textual evaluations and explanations, enabling off-policy training via a language Bellman backup and a refinement-based policy improvement that leverages in-context reasoning. The authors provide theoretical connections to successor features and prove convergence under assumptions, and demonstrate strong empirical gains across reasoning, web-browsing, and dialogue tasks, often outperforming traditional RL fine-tuning and prompting baselines. NLAC offers a scalable, data-efficient framework that harnesses LLM capabilities to reason about action improvements in language space, reducing reliance on random exploration. Potential future work includes integrating scalar value signals and techniques to mitigate catastrophic forgetting while maintaining the benefits of language-based critiques.

Abstract

Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

TL;DR

The paper tackles the training inefficiencies of long-horizon LLM agents by introducing Natural Language Actor-Critic (NLAC), which replaces scalar value functions with a natural language critic that reasons about future outcomes. The critic generates textual evaluations and explanations, enabling off-policy training via a language Bellman backup and a refinement-based policy improvement that leverages in-context reasoning. The authors provide theoretical connections to successor features and prove convergence under assumptions, and demonstrate strong empirical gains across reasoning, web-browsing, and dialogue tasks, often outperforming traditional RL fine-tuning and prompting baselines. NLAC offers a scalable, data-efficient framework that harnesses LLM capabilities to reason about action improvements in language space, reducing reliance on random exploration. Potential future work includes integrating scalar value signals and techniques to mitigate catastrophic forgetting while maintaining the benefits of language-based critiques.

Abstract

Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

Paper Structure

This paper contains 24 sections, 2 theorems, 17 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 5.1

Consider policy evaluation via Equation eq:successor_train and let $Q_L^\pi$ be the natural language critic at convergence. For any state $s$ and action $a$, there exists monotonic mapping $g$ such that $Q^\pi(s, a) = g(Q_L^\pi(s, a))$, where $Q^\pi$ denotes the true scalar Q-function.

Figures (5)

  • Figure 1: Overview of NLAC. During policy evaluation, the critic is trained using a language Bellman backup that operates in textual space. During policy improvement, the policy is distilled from a refinement policy.
  • Figure 2: Sample timestep on 20Q, where the LLM agent attempts to guess the hidden object "raisin." The base LLM agent has narrowed down the object to a non-red fruit found in salads, but proceeds to search over the color. However, color is often not the most defining characteristic, so it is more optimal to search over other discriminators such as taste or size.
  • Figure 3: Sample timestep on $\tau$-bench where a base LLM agent fails by modifying the database (which can only be done once according to the guidelines) when more exchanges are likely needed. The natural language critic correctly identifies why the action is suboptimal, and explains it in language so that the same LLM can process the critique and correct its action.
  • Figure 4: Learning curves for NLAC and PPO across three independent runs. NLAC converges in fewer samples.
  • Figure : Natural Language Actor-Critic (NLAC)

Theorems & Definitions (5)

  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Theorem 5.1
  • Theorem 5.2