Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

Ruizhe Shi; Yuyao Liu; Yanjie Ze; Simon S. Du; Huazhe Xu

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

Ruizhe Shi, Yuyao Liu, Yanjie Ze, Simon S. Du, Huazhe Xu

TL;DR

Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reWARD tasks.

Abstract

Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves excellent performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

TL;DR

Empirical results indicate

achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reWARD tasks.

Abstract

nguage Models for

tion Control (

), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate

achieves excellent performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.

Paper Structure (25 sections, 7 equations, 12 figures, 16 tables)

This paper contains 25 sections, 7 equations, 12 figures, 16 tables.

Introduction
Related Work
Preliminaries
Offline Reinforcement Learning
Decision Transformer
Method
Pre-training on Language Tasks
Finetuning for Offline Reinforcement Learning
Experiments
Experiment Setup
Sparse-reward tasks
Dense-reward tasks
Ability in Low-Data Regime
Ablations
Conclusion
...and 10 more sections

Figures (12)

Figure 1: Normalized score on D4RL d4rl dataset of Language Models for Motion Control (LaMo), Decision Transformer (DT, DT), Wiki-RL wiki, Conservative Q-Learning (CQL, CQL) and Behavior Cloning (BC). We average scores over tasks and data sample ratios for each domain. (Medium for Mujoco and Atari, Complete and Partial for Kitchen, of different sample ratios, described in Appendix \ref{['section:Dataset Descriptions']}.)
Figure 2: The overview of LaMo. LaMo mainly consists of two stages: (1) pre-training LMs on language tasks, (2) freezing the pre-trained attention layers, replacing linear projections with MLPs, and using LoRA to adapt to RL tasks. We also apply the language loss during the offline RL stage as a regularizer.
Figure 3: Normalized score obtained by LaMo, CQL, and DT on various data sample ratios. Mean of $3$ seeds with number $0,1,2$. Shaded area is $[\mu-0.5\sigma,\mu+0.5\sigma]$ interval, where $\mu$ is the average and $\sigma$ is the standard deviation.
Figure 4: Ablation on the effectiveness of MLP embeddings. We replace the MLPs in LaMo as embeddings with linear projections, denoted as LaMo w/o. MLP. We compare LaMo with LaMo w/o. MLP and DT across all tasks. Mean of $3$ seeds with number $0,1,2$. Shaded area is $[\mu-0.5\sigma,\mu+0.5\sigma]$ interval, where $\mu$ is the average and $\sigma$ is the standard deviation.
Figure 5: Ablation on the effectiveness of LoRA. (1) We involve all the parameters into fine-tuning, denoted as Full Finetuning. (2) We freeze all parameters in Transformer layers and leave out LoRA, denoted as Freezing. We compare LaMo with Full Finetuning, Freezing, and DT.
...and 7 more figures

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

TL;DR

Abstract

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)