Table of Contents
Fetching ...

Natural Language Reinforcement Learning

Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A. Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, Jun Wang

TL;DR

NLRL reframes reinforcement learning in natural language by substituting scalar value functions with Language Value Functions (LVFs) that are interpretable narrations generated by LLMs. It extends core RL constructs to language equivalents (policy, Bellman updates, policy iteration) and introduces Language GPI, language TD, and a full language-based actor-critic loop to enable active, deliberative learning from experience. Across four multi-step tasks, including tic-tac-toe, FrozenLake, and Breakthrough, the approach demonstrates improved learning efficiency, robustness to stochastic dynamics, and the ability to elicit richer reasoning than traditional RL or prompting-augmented baselines. The work suggests a path toward more capable, introspective agents and highlights both practical benefits and limitations of relying on language-grounded value estimates for RL.

Abstract

Artificial intelligence progresses towards the "Era of Experience," where agents are expected to learn from continuous, grounded interaction. We argue that traditional Reinforcement Learning (RL), which typically represents value as a scalar, can restrict agent's deep understanding of environments and hinders the active, deliberative learning crucial for navigating this new paradigm. To address the issue, we introduce Natural Language Reinforcement Learning (NLRL), a framework that extends RL principles into natural language counterparts. Central to NLRL is the Language Value Function (LVF), which redefines value as an interpretable linguistic narrative articulating the rationale behind an evaluation. NLRL further extends this concept to core RL components, including policy, the Bellman equation, and policy iteration. Leveraging recent advancements in Large Language Models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value training through unsupervised environment interactions. Experiments over 4 multi-step agentic tasks demonstrate NLRL's effectiveness, efficiency, and its potential to foster deeper understanding and more active learning strategies.

Natural Language Reinforcement Learning

TL;DR

NLRL reframes reinforcement learning in natural language by substituting scalar value functions with Language Value Functions (LVFs) that are interpretable narrations generated by LLMs. It extends core RL constructs to language equivalents (policy, Bellman updates, policy iteration) and introduces Language GPI, language TD, and a full language-based actor-critic loop to enable active, deliberative learning from experience. Across four multi-step tasks, including tic-tac-toe, FrozenLake, and Breakthrough, the approach demonstrates improved learning efficiency, robustness to stochastic dynamics, and the ability to elicit richer reasoning than traditional RL or prompting-augmented baselines. The work suggests a path toward more capable, introspective agents and highlights both practical benefits and limitations of relying on language-grounded value estimates for RL.

Abstract

Artificial intelligence progresses towards the "Era of Experience," where agents are expected to learn from continuous, grounded interaction. We argue that traditional Reinforcement Learning (RL), which typically represents value as a scalar, can restrict agent's deep understanding of environments and hinders the active, deliberative learning crucial for navigating this new paradigm. To address the issue, we introduce Natural Language Reinforcement Learning (NLRL), a framework that extends RL principles into natural language counterparts. Central to NLRL is the Language Value Function (LVF), which redefines value as an interpretable linguistic narrative articulating the rationale behind an evaluation. NLRL further extends this concept to core RL components, including policy, the Bellman equation, and policy iteration. Leveraging recent advancements in Large Language Models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value training through unsupervised environment interactions. Experiments over 4 multi-step agentic tasks demonstrate NLRL's effectiveness, efficiency, and its potential to foster deeper understanding and more active learning strategies.

Paper Structure

This paper contains 60 sections, 9 equations, 13 figures, 10 tables, 3 algorithms.

Figures (13)

  • Figure 1: Comparison between NLRL and traditional RL in agentic LLM
  • Figure 2: Comparing after-train reasoning on FrozenLake between RAGEN wang2025ragen and NLRL (Ours). Traditional RL results in a meaningless CoT, while NLRL keeps informative CoT and active reasoning.
  • Figure 3: Practical pipeline for implementing NLRL in the Tic-tac-toe game. LLMs can serve as the language policy ①, the language-based value function approximator ②, the language Monte Carlo or Temporal Difference operator ③, and the language policy improvement operator ⑤. By distilling (④, ⑥) the improved evaluations from ② and the enhanced actions from ⑤, the NLRL agent can iteratively refine its language policy and evaluation capabilities.
  • Figure 4: Breakthrough experiment results. (a) Performance comparison with baselines. (b,d) Ablation study over look-ahead step number and variation number. (c) Results for state scaling law.
  • Figure 5: Natural Language Actor Critic Pipeline training results. (a) Training results against the Random-Move Opponent. (b) Ablation study on components ($K_{MC}$, $K_{buffer}$, and Action Selection Mask). These results demonstrate that our proposed Natural Language Actor Critic pipeline can stably improve under stochastic dynamics. (c) - (e) Ablation studies on number of training epochs, Monte Carlo sample size $K_{MC}$, and number of rollout trajectories.
  • ...and 8 more figures