Table of Contents
Fetching ...

Dataset Distillation for Offline Reinforcement Learning

Jonathan Light, Yuanzhe Liu, Ziniu Hu

TL;DR

This paper tackles offline reinforcement learning by proposing dataset distillation via gradient matching to synthesize a compact, high-quality training dataset from a given offline collection. By aligning the BC gradient signals between real and synthetic data, the method enables a policy trained on a small synthetic set to achieve performance comparable to or better than training on the full offline dataset or percentile-filtered baselines, as demonstrated on Procgen tasks. The key contribution is the synthetic-data approach, which improves data efficiency and generalization in offline RL, while revealing practical limitations such as action-imbalance effects in certain environments. The work suggests a promising direction for data-centric RL where smaller, curated datasets enable robust policy learning with reduced computational and data collection demands.

Abstract

Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .

Dataset Distillation for Offline Reinforcement Learning

TL;DR

This paper tackles offline reinforcement learning by proposing dataset distillation via gradient matching to synthesize a compact, high-quality training dataset from a given offline collection. By aligning the BC gradient signals between real and synthetic data, the method enables a policy trained on a small synthetic set to achieve performance comparable to or better than training on the full offline dataset or percentile-filtered baselines, as demonstrated on Procgen tasks. The key contribution is the synthetic-data approach, which improves data efficiency and generalization in offline RL, while revealing practical limitations such as action-imbalance effects in certain environments. The work suggests a promising direction for data-centric RL where smaller, curated datasets enable robust policy learning with reduced computational and data collection demands.

Abstract

Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .
Paper Structure (19 sections, 5 equations, 4 figures, 4 tables)

This paper contains 19 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our dataset distillation process. On the left we train the dataset by taking the matching gradient loss between the real offline dataset and our synthetic dataset. On the right we then use the trained synthetic dataset to train a RL model, which we then evaluate on the real environment.
  • Figure 2: Screenshots of games in Procgen Benchmark pmlr-v119-cobbe20a
  • Figure 3: In distribution performance of various data collection methods
  • Figure 4: Out of distribution performance of various data collection methods