Pre-training with Synthetic Data Helps Offline Reinforcement Learning

Zecheng Wang; Che Wang; Zixuan Dong; Keith Ross

Pre-training with Synthetic Data Helps Offline Reinforcement Learning

Zecheng Wang, Che Wang, Zixuan Dong, Keith Ross

TL;DR

The paper tackles offline reinforcement learning by examining whether language-based pre-training is essential for gains. It introduces synthetic data pre-training for Decision Transformer, using a one-step Markov Chain with $M=100$ states and autoregressive next-state prediction, which yields substantial improvements over both baseline DT and Wiki-pre-trained DT, with only a fraction of the pre-training updates. The authors extend the approach to Conservative Q-Learning by pre-training an MLP with synthetic MDP data using a forward dynamics objective, achieving consistent improvements across D4RL locomotion tasks; IID synthetic data performs nearly as well as MDp data, and the underlying theory shows the forward dynamics loss reduces to a centroid computation, helping explain robustness. Overall, the work demonstrates that inexpensive, synthetic pre-training can meaningfully boost offline DRL performance across architectures, suggesting a practical shift away from language pre-training toward simple, data-driven pre-training schemes.

Abstract

Recently, it has been shown that for offline deep reinforcement learning (DRL), pre-training Decision Transformer with a large language corpus can improve downstream performance (Reid et al., 2022). A natural question to ask is whether this performance gain can only be achieved with language pre-training, or can be achieved with simpler pre-training schemes which do not involve language. In this paper, we first show that language is not essential for improved performance, and indeed pre-training with synthetic IID data for a small number of updates can match the performance gains from pre-training with a large language corpus; moreover, pre-training with data generated by a one-step Markov chain can further improve the performance. Inspired by these experimental results, we then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm, which is Q-learning-based and typically employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL, providing consistent performance improvement on D4RL Gym locomotion datasets. The results of this paper not only illustrate the importance of pre-training for offline DRL but also show that the pre-training data can be synthetic and generated with remarkably simple mechanisms.

Pre-training with Synthetic Data Helps Offline Reinforcement Learning

TL;DR

states and autoregressive next-state prediction, which yields substantial improvements over both baseline DT and Wiki-pre-trained DT, with only a fraction of the pre-training updates. The authors extend the approach to Conservative Q-Learning by pre-training an MLP with synthetic MDP data using a forward dynamics objective, achieving consistent improvements across D4RL locomotion tasks; IID synthetic data performs nearly as well as MDp data, and the underlying theory shows the forward dynamics loss reduces to a centroid computation, helping explain robustness. Overall, the work demonstrates that inexpensive, synthetic pre-training can meaningfully boost offline DRL performance across architectures, suggesting a practical shift away from language pre-training toward simple, data-driven pre-training schemes.

Abstract

Paper Structure (36 sections, 6 equations, 17 figures, 29 tables)

This paper contains 36 sections, 6 equations, 17 figures, 29 tables.

Introduction
Related Work
Pre-training Decision Transformer with Synthetic Data
Overview of Decision Transformer
Generating Synthetic Markov Chain Data
Results for pre-training DT with synthetic data
Ablations for Pre-training DT with Synthetic Data
Pre-training CQL with Synthetic Data
Generating Synthetic MDP Data
Results for CQL with Synthetic Data Pre-training
Analysis of optimization objective
Conclusion
Hyperparameters & Training Details
Decision Transformer
Implementation & Experiment details
...and 21 more sections

Figures (17)

Figure 1: Performance and loss curves, averaged over 12 datasets for DT, DT+Wiki, DT+Synthetic.
Figure 2: Performance and loss curves, averaged over 12 datasets for CQL, CQL+MDP and CQL+IID.
Figure 3: Learning curves for DT, DT with Wikipedia pre-training, and DT with synthetic pre-training. Our pre-training scheme (DT+Synthetic) has been offset for 20000 updates to represent the 20000 pre-training updates with synthetic data.
Figure 4: Training loss curves for DT, DT with Wikipedia pre-training, and DT with synthetic pre-training. Our pre-training scheme (DT+Synthetic) has been offset for 20000 updates to represent the 20000 pre-training updates with synthetic data.
Figure 5: Fine-tuning performance curves for CQL baseline, CQL with synthetic MDP pre-training, and CQL with synthetic IID pre-training, on each individual dataset.
...and 12 more figures

Pre-training with Synthetic Data Helps Offline Reinforcement Learning

TL;DR

Abstract

Pre-training with Synthetic Data Helps Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (17)