Natural Language Reinforcement Learning

Xidong Feng; Ziyu Wan; Mengyue Yang; Ziyan Wang; Girish A. Koushik; Yali Du; Ying Wen; Jun Wang

Natural Language Reinforcement Learning

Xidong Feng, Ziyu Wan, Mengyue Yang, Ziyan Wang, Girish A. Koushik, Yali Du, Ying Wen, Jun Wang

TL;DR

This work introduces Natural Language Reinforcement Learning (NLRL), a framework that maps traditional RL concepts onto natural language representations and leverages large language models (LLMs) to perform policy evaluation, improvement, and value estimation in language space. It defines a text-based MDP with a language task instruction $T_L$, language descriptors $D_L$, and language value functions $V^L_\\pi$, $Q^L_\\pi$, along with a language Bellman equation, enabling language-driven generalized policy iteration. LLMs serve as the policy, value function approximator, information aggregator, and policy-improvement operator, leveraging chain-of-thought and concept-based abstractions to enhance interpretability. Demonstrations on tabular MDPs (text GridWorld and Frozen Lake) show that NLRL can achieve interpretable, information-rich reasoning and competitive performance, though hallucinatory outputs and scalability remain key challenges. Overall, the approach highlights a promising direction for interpretable, language-based RL and suggests that language can provide dense supervisory signals and rich prior knowledge for learning.

Abstract

Reinforcement Learning (RL) has shown remarkable abilities in learning policies for decision-making tasks. However, RL is often hindered by issues such as low sample efficiency, lack of interpretability, and sparse supervision signals. To tackle these limitations, we take inspiration from the human learning process and introduce Natural Language Reinforcement Learning (NLRL), which innovatively combines RL principles with natural language representation. Specifically, NLRL redefines RL concepts like task objectives, policy, value function, Bellman equation, and policy iteration in natural language space. We present how NLRL can be practically implemented with the latest advancements in large language models (LLMs) like GPT-4. Initial experiments over tabular MDPs demonstrate the effectiveness, efficiency, and also interpretability of the NLRL framework.

Natural Language Reinforcement Learning

TL;DR

, language descriptors

, and language value functions

, along with a language Bellman equation, enabling language-driven generalized policy iteration. LLMs serve as the policy, value function approximator, information aggregator, and policy-improvement operator, leveraging chain-of-thought and concept-based abstractions to enhance interpretability. Demonstrations on tabular MDPs (text GridWorld and Frozen Lake) show that NLRL can achieve interpretable, information-rich reasoning and competitive performance, though hallucinatory outputs and scalability remain key challenges. Overall, the approach highlights a promising direction for interpretable, language-based RL and suggests that language can provide dense supervisory signals and rich prior knowledge for learning.

Abstract

Paper Structure (17 sections, 10 equations, 5 figures, 2 tables)

This paper contains 17 sections, 10 equations, 5 figures, 2 tables.

Introduction
Preliminary of Reinforcement Learning
Natural Language Reinforcement Learning
Definitions
Language Generalized Policy Iteration
Language Policy Evaluation
Language Policy Improvement
Practical Implementation with large language models
Discussions over other RL concepts
Experiments
Warm-Up: Language Policy Evaluation in text GridWorld
Language Policy evaluation and improvement in Stochastic Environment
Related work
Conclusion and Limitations
Experimental details
...and 2 more sections

Figures (5)

Figure 1: We present an illustrative example of grid-world MDP to show how NLRL and traditional RL differ for task objective, value function, Bellman equation, and generalized policy iteration. In this grid-world, the robot needs to reach the crown and avoid all dangers. We assume the robot policy takes optimal action at each non-terminal state, except a uniformly random policy at state b.
Figure 2: How the language evaluation over state $g$ at (0,3) evolves across iterations. Iter 0: initial descriptions. Iter 1: Intermediate changes exclude action go up and move right. However, determining the optimal move is not possible without next-state evaluations. Iter 3: identify two optimal actions by information transmission from the goal. Iter 4: Evaluation over the state $g$ converges.
Figure 3: The information flows from the goal to every state across iterations. The yellow grid represents the grids where the goal information is transmitted. The blue array denotes the direction of information transmission by applying the one-step language Bellman update.
Figure 4: Frozen-Lake example of the language value function and language policy improvement. The language value function addresses the 5 predefined concepts, while the language policy improvements conduct chain-of-thought reasoning to determine the final action.
Figure 5: The policy value at each state in Frozen-Lake.

Natural Language Reinforcement Learning

TL;DR

Abstract

Natural Language Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)