LLM-Empowered State Representation for Reinforcement Learning

Boyuan Wang; Yun Qu; Yuhang Jiang; Jianzhun Shao; Chang Liu; Wenming Yang; Xiangyang Ji

LLM-Empowered State Representation for Reinforcement Learning

Boyuan Wang, Yun Qu, Yuhang Jiang, Jianzhun Shao, Chang Liu, Wenming Yang, Xiangyang Ji

TL;DR

This work tackles RL sample-inefficiency and unstable value mappings caused by generic state representations. It introduces LESR, a framework that uses a Large Language Model to generate task-related state representations $\mathcal{F}: \mathcal{S} \to \mathcal{S}^r$ and intrinsic reward functions $\mathcal{G}: \mathcal{S}^c \to \mathbb{R}$, guided by Lipschitz-constant feedback across iterations. The authors establish theoretical connections showing that reducing the Lipschitz constant of the reward $Lip(r; \mathcal{S})$ lowers the upper bound on the value-function Lipschitz constant $Lip(V; \mathcal{S})$, improving convergence, and they validate these insights with empirical gains (approximately 29% in Mujoco and 30% in Gym-Robotics) and ablations demonstrating component importance. LESR also demonstrates transferability of the learned representations to other RL algorithms (PPO, SAC) and exhibits semantic coherence and robustness across seeds and hyperparameters, suggesting practical applicability to a range of tasks. Overall, LESR offers a promising, model-agnostic approach to enhance RL by leveraging LLM-driven representations paired with Lipschitz-based feedback.

Abstract

Conventional state representations in reinforcement learning often omit critical task-related details, presenting a significant challenge for value networks in establishing accurate mappings from states to task rewards. Traditional methods typically depend on extensive sample learning to enrich state representations with task-specific information, which leads to low sample efficiency and high time costs. Recently, surging knowledgeable large language models (LLM) have provided promising substitutes for prior injection with minimal human intervention. Motivated by this, we propose LLM-Empowered State Representation (LESR), a novel approach that utilizes LLM to autonomously generate task-related state representation codes which help to enhance the continuity of network mappings and facilitate efficient training. Experimental results demonstrate LESR exhibits high sample efficiency and outperforms state-of-the-art baselines by an average of 29% in accumulated reward in Mujoco tasks and 30% in success rates in Gym-Robotics tasks.

LLM-Empowered State Representation for Reinforcement Learning

TL;DR

and intrinsic reward functions

, guided by Lipschitz-constant feedback across iterations. The authors establish theoretical connections showing that reducing the Lipschitz constant of the reward

lowers the upper bound on the value-function Lipschitz constant

, improving convergence, and they validate these insights with empirical gains (approximately 29% in Mujoco and 30% in Gym-Robotics) and ablations demonstrating component importance. LESR also demonstrates transferability of the learned representations to other RL algorithms (PPO, SAC) and exhibits semantic coherence and robustness across seeds and hyperparameters, suggesting practical applicability to a range of tasks. Overall, LESR offers a promising, model-agnostic approach to enhance RL by leveraging LLM-driven representations paired with Lipschitz-based feedback.

Abstract

Paper Structure (31 sections, 11 theorems, 41 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 11 theorems, 41 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Method
Problem Statement
LLM-Empowered State Representation
Lipschitz Constant for Feedback
Theoretical Analysis
Experiments
Implementation Details
LESR Can Enhance Lipschitz Continuity
LESR Can Achieve High Sample Efficiency
Ablation Study
Directly Transfer to Other Algorithms
Semantic Analysis and Consistency Verification
Robustness
...and 16 more sections

Key Result

Theorem 3.4

Under Assumption assumption, Given $\mathcal{X}_0, \mathcal{Y}_0$ and $\mathcal{X}_1, \mathcal{Y}_1=\{y_i | y_i=u_1^*(x_i), x_i \in \mathcal{X}_1\}$, $u_0\in\mathcal{U}_0$ is any minimizer of $\mathcal{L}(u, \mathcal{X}_0)$ and $u_1\in\mathcal{U}_1$ is any minimizer of $\mathcal{L}(u, \mathcal{X}_1)

Figures (8)

Figure 1: Experiments conducted in the PointMaze in Gym-Robotics de2023gymnasium. The agent aims to navigate from the bottom-left (depicted by '$\bigcirc$') to the top-right target (depicted by '$\triangle$'). (a) The training curves of policies in two different state representations: (1) Source: which only involves the coordinates of the agent and the target; (2) LLM-Generated: which adds another dimension indicating the distance from the agent to the target. (b) (c): Visualizations of final policies and learned state values. Arrows show actions by final policies. Heatmaps display learned state values after just 500 training steps, and the smoother result of LLM-generated shows higher sample efficiency. The Lipschitz constant Lip($Q$) is defined in Definition \ref{['defi:constant']}. Further details are provided in Appendix \ref{['appendix:toy_example']}.
Figure 2: LESR Framework: (1) LLM is prompted to generate codes for state representation and intrinsic reward functions. Refer to Appendix \ref{['appendix:prompts']} for details on all prompt templates. (2) $K$ state representations and intrinsic rewards $\{\mathcal{F}_k\}_{k=1}^K, \{\mathcal{G}_k\}_{k=1}^K$ are sampled from LLM. (3) During RL training, function $\mathcal{F}$ and $\mathcal{G}$ are utilized to generate $s^r = \mathcal{F}(s)$ for state representations, and $r^i = \mathcal{G}(s, s^r)$ for intrinsic rewards. (4) Finally, Lipschitz constants and episode returns of each candidate serve as feedback metrics for LLM.
Figure 3: Visualization illustrating states post 2D dimensionality reduction via t-SNE. Details of $T_1$ and $T_2$ can be referred to Section \ref{['sec:demo_tsne']}. The reward for each state is normalized to a range of $[0, 1]$ and discretized, and the graph employs color coding to represent their respective reward values.
Figure 4: Comparison of Mujoco tasks between LESR and the removal of source state during training (i.e., utilizing only the ($\mathcal{F}(s)$) as input for policy and critic network training). The y-axis is on a logarithmic scale, and error bars represent 5 random seeds.
Figure 5: Experimental results for hyperparameter variations in AdroitHandDoor. The sample count $K$ is adjusted to [1, 3, 6, 9] (bottom x-axis), and the intrinsic reward weight $w$ is modified to [0.01, 0.02, 0.05, 0.1] (top x-axis).
...and 3 more figures

Theorems & Definitions (25)

Definition 3.1
Definition 3.2
Theorem 3.4
Definition 1.1
Theorem 1.2: Convergence
proof : Proof of Theorem \ref{['theorem:convergence1']}
Lemma 1.3: Lemma 2.9. in oberman2018lipschitz
Theorem 1.4
proof : Proof of Theorem \ref{['theorem:convergence2']}
Lemma 1.5
...and 15 more

LLM-Empowered State Representation for Reinforcement Learning

TL;DR

Abstract

LLM-Empowered State Representation for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (25)