Natural Language Reinforcement Learning
Xidong Feng, Ziyu Wan, Mengyue Yang, Ziyan Wang, Girish A. Koushik, Yali Du, Ying Wen, Jun Wang
TL;DR
This work introduces Natural Language Reinforcement Learning (NLRL), a framework that maps traditional RL concepts onto natural language representations and leverages large language models (LLMs) to perform policy evaluation, improvement, and value estimation in language space. It defines a text-based MDP with a language task instruction $T_L$, language descriptors $D_L$, and language value functions $V^L_\\pi$, $Q^L_\\pi$, along with a language Bellman equation, enabling language-driven generalized policy iteration. LLMs serve as the policy, value function approximator, information aggregator, and policy-improvement operator, leveraging chain-of-thought and concept-based abstractions to enhance interpretability. Demonstrations on tabular MDPs (text GridWorld and Frozen Lake) show that NLRL can achieve interpretable, information-rich reasoning and competitive performance, though hallucinatory outputs and scalability remain key challenges. Overall, the approach highlights a promising direction for interpretable, language-based RL and suggests that language can provide dense supervisory signals and rich prior knowledge for learning.
Abstract
Reinforcement Learning (RL) has shown remarkable abilities in learning policies for decision-making tasks. However, RL is often hindered by issues such as low sample efficiency, lack of interpretability, and sparse supervision signals. To tackle these limitations, we take inspiration from the human learning process and introduce Natural Language Reinforcement Learning (NLRL), which innovatively combines RL principles with natural language representation. Specifically, NLRL redefines RL concepts like task objectives, policy, value function, Bellman equation, and policy iteration in natural language space. We present how NLRL can be practically implemented with the latest advancements in large language models (LLMs) like GPT-4. Initial experiments over tabular MDPs demonstrate the effectiveness, efficiency, and also interpretability of the NLRL framework.
