Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong; Kang Liu; Zhan Ling; Jiecao Chen; Sergey Levine

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine

TL;DR

The paper tackles the training inefficiencies of long-horizon LLM agents by introducing Natural Language Actor-Critic (NLAC), which replaces scalar value functions with a natural language critic that reasons about future outcomes. The critic generates textual evaluations and explanations, enabling off-policy training via a language Bellman backup and a refinement-based policy improvement that leverages in-context reasoning. The authors provide theoretical connections to successor features and prove convergence under assumptions, and demonstrate strong empirical gains across reasoning, web-browsing, and dialogue tasks, often outperforming traditional RL fine-tuning and prompting baselines. NLAC offers a scalable, data-efficient framework that harnesses LLM capabilities to reason about action improvements in language space, reducing reliance on random exploration. Potential future work includes integrating scalar value signals and techniques to mitigate catastrophic forgetting while maintaining the benefits of language-based critiques.

Abstract

Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

TL;DR

Abstract

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)