Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

Jing-Cheng Pang; Si-Hang Yang; Kaiyuan Li; Jiaji Zhang; Xiong-Hui Chen; Nan Tang; Yang Yu

Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

Jing-Cheng Pang, Si-Hang Yang, Kaiyuan Li, Jiaji Zhang, Xiong-Hui Chen, Nan Tang, Yang Yu

TL;DR

This paper tackles the limited generalization of offline RL by leveraging knowledge encoded in large language models. It introduces KALM, which grounds an LLM in the environment to generate imaginary rollouts for novel skills that augment offline RL training. Empirical results on CLEVR-Robot show that KALM significantly improves unseen-goal success rates (e.g., 46% vs 26%) and demonstrates meaningful LLM grounding through rollout quality and explanation accuracy. The approach demonstrates a practical pathway for integrating rich language-derived knowledge with reinforcement learning to broaden agent capabilities, while acknowledging limitations such as modality scope and domain generalization.

Abstract

Reinforcement learning (RL) trains agents to accomplish complex tasks through environmental interaction data, but its capacity is also limited by the scope of the available data. To obtain a knowledgeable agent, a promising approach is to leverage the knowledge from large language models (LLMs). Despite previous studies combining LLMs with RL, seamless integration of the two components remains challenging due to their semantic gap. This paper introduces a novel method, Knowledgeable Agents from Language Model Rollouts (KALM), which extracts knowledge from LLMs in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods. The primary challenge of KALM lies in LLM grounding, as LLMs are inherently limited to textual data, whereas environmental data often comprise numerical vectors unseen to LLMs. To address this, KALM fine-tunes the LLM to perform various tasks based on environmental data, including bidirectional translation between natural language descriptions of skills and their corresponding rollout data. This grounding process enhances the LLM's comprehension of environmental dynamics, enabling it to generate diverse and meaningful imaginary rollouts that reflect novel skills. Initial empirical evaluations on the CLEVR-Robot environment demonstrate that KALM enables agents to complete complex rephrasings of task goals and extend their capabilities to novel tasks requiring unprecedented optimal behaviors. KALM achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods. Furthermore, KALM effectively enables the LLM to comprehend environmental dynamics, resulting in the generation of meaningful imaginary rollouts that reflect novel skills and demonstrate the seamless integration of large language models and reinforcement learning.

Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

TL;DR

Abstract

Paper Structure (34 sections, 10 figures, 1 table)

This paper contains 34 sections, 10 figures, 1 table.

Introduction
Related Work
Offline Reinforcement Learning
Large Language Models
Integrating LLMs and RL
Preliminary
Reinforcement learning.
Large language model.
Method
Problem Formulation
LLM Grounding with Instruction-following Fine-tuning
Rollout Generation with Novel Skill Prompt
Skill Acquisition via Offline Reinforcement Learning
Experiment
Experimental Setting
...and 19 more sections

Figures (10)

Figure 1: Illustration of KALM utilizing LLM to generate environmental rollouts. (1) Grounding phase that fine-tunes LLM with supervised fitting on the environmental data. (2) Generation phase that prompts LLM to generate data for novel skills. KALM modifies the input/output layer of LLM, enabling it to process and interpret non-textual data.
Figure 2: Overall procedure of KALM, consisting of three key modules: (A) LLM grounding module that grounds LLM in the environment and aligns LLM with inputs of environmental data, (B) Rollout generation module that prompts the LLM to generate data for novel skills and (C) Skill Acquisition module that trains the policy with offline RL. Finally, KALM derives a policy that trained on both offline data and imaginary data.
Figure 3: KALM utilizes a pre-trained LLM as the backbone model, but adapts the architecture the LLM to process non-textual data. It accepts a variety of data types, including state, action, and text token. KALM employs distinct embedding layers to convert these inputs into same-dimension embeddings. For outputs, the LLM uses different output heads to output the predicted state, action and text.
Figure 4: A visualization of CLEVR-Robot environment in our experiments. The agent (silverpoint) manipulates five movable balls to reach a specific configuration. An example of natural language goal in the offline dataset is: Can you move the purple ball to the left of the blue ball?
Figure 5: Training curves of different methods on four types of tasks. The x-axis represents the number of training epochs, and the y-axis represents the success rate of completing the natural language goals for different types of tasks. The shaded area stands for the standard deviation over three random seeds.
...and 5 more figures

Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

TL;DR

Abstract

Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

Authors

TL;DR

Abstract

Table of Contents

Figures (10)