LLM-Empowered State Representation for Reinforcement Learning
Boyuan Wang, Yun Qu, Yuhang Jiang, Jianzhun Shao, Chang Liu, Wenming Yang, Xiangyang Ji
TL;DR
This work tackles RL sample-inefficiency and unstable value mappings caused by generic state representations. It introduces LESR, a framework that uses a Large Language Model to generate task-related state representations $\mathcal{F}: \mathcal{S} \to \mathcal{S}^r$ and intrinsic reward functions $\mathcal{G}: \mathcal{S}^c \to \mathbb{R}$, guided by Lipschitz-constant feedback across iterations. The authors establish theoretical connections showing that reducing the Lipschitz constant of the reward $Lip(r; \mathcal{S})$ lowers the upper bound on the value-function Lipschitz constant $Lip(V; \mathcal{S})$, improving convergence, and they validate these insights with empirical gains (approximately 29% in Mujoco and 30% in Gym-Robotics) and ablations demonstrating component importance. LESR also demonstrates transferability of the learned representations to other RL algorithms (PPO, SAC) and exhibits semantic coherence and robustness across seeds and hyperparameters, suggesting practical applicability to a range of tasks. Overall, LESR offers a promising, model-agnostic approach to enhance RL by leveraging LLM-driven representations paired with Lipschitz-based feedback.
Abstract
Conventional state representations in reinforcement learning often omit critical task-related details, presenting a significant challenge for value networks in establishing accurate mappings from states to task rewards. Traditional methods typically depend on extensive sample learning to enrich state representations with task-specific information, which leads to low sample efficiency and high time costs. Recently, surging knowledgeable large language models (LLM) have provided promising substitutes for prior injection with minimal human intervention. Motivated by this, we propose LLM-Empowered State Representation (LESR), a novel approach that utilizes LLM to autonomously generate task-related state representation codes which help to enhance the continuity of network mappings and facilitate efficient training. Experimental results demonstrate LESR exhibits high sample efficiency and outperforms state-of-the-art baselines by an average of 29% in accumulated reward in Mujoco tasks and 30% in success rates in Gym-Robotics tasks.
