Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

Aviral Kumar; Rishabh Agarwal; Xinyang Geng; George Tucker; Sergey Levine

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine

TL;DR

The paper addresses the challenge of scaling offline Q-learning to large, diverse data by proposing Scaled Q-learning, which combines ResNet-based encoders, distributional (C51) backups, and feature normalization in a multi-task Atari setting.It shows that capacity scaling yields favorable performance trends, with Scaled Q-learning surpassing supervised baselines on suboptimal data and matching or exceeding performance on near-optimal data using substantially fewer parameters than some competitors.The work also demonstrates that offline multi-task training learns representations that enable strong transfer to unseen games and rapid online fine-tuning on novel game variants, highlighting the potential of offline RL to generalize beyond the training dataset.Overall, the results suggest offline Q-learning can scale with model capacity to produce broadly generalizable policies and transferable representations, motivating further exploration of large-scale offline RL in diverse domains.

Abstract

The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 11 figures, 8 tables)

This paper contains 21 sections, 2 equations, 11 figures, 8 tables.

Introduction
Related Work
Preliminaries and Problem Setup
Our Approach for Scaling Offline RL
Experimental Evaluation
Does Offline Q-Learning Scale Favorably?
Can Offline RL Learn Useful Initializations that Enable Fine-Tuning?
Ablation Studies
Discussion
Additional Results
Additional Results from the Paper
Results for Scaling Discrete-BCQ
Ablation for Backbone Architecture
Results for Scaled QL Without Pessimism
Implementation Details and Hyper-parameters
...and 6 more sections

Figures (11)

Figure 1: An overview of the training and evaluation setup. Models are trained offline with potentially sub-optimal data. We adapt CQL to the multi-task setup via a multi-headed architecture. The pre-trained visual encoder is reused in fine-tuning (the weights are either frozen or fine-tuned), whereas the downstream fully-connected layers are reinitialized and trained.
Figure 2: Offline multi-task performance on 40 games with sub-optimal data. Left. Scaled QL significantly outperforms the previous state-of-the-art method, DT, attaining about a 2.5x performance improvement in normalized IQM score. To contextualize the absolute numbers, we include online multi-task Impala DQN espeholt2018impala trained on 5x as much data. Right. Performance profiles agarwal2021deep showing the distribution of normalized scores across all 40 training games (higher is better). Scaled QL stochastically dominates other offline RL algorithms and achieves superhuman performance in 40% of the games. "Behavior policy" corresponds to the score of the dataset trajectories. Online MT DQN (5X), taken directly from lee2022multi, corresponds to running multi-task online RL for 5x more data with IMPALA (details in Appendix \ref{['sec:online_mt_dqn']}).
Figure 3: An overview of the network architecture. The key design decisions are: (1) the use of ResNet models with learned spatial embeddings and group normalization, (2) use of a distributional representation of return values and cross-entropy TD loss for training (i.e., C51 bellemare2017distributional), and (3) feature normalization to stablize training.
Figure 4: Comparing Scaled QL to DT on all training games on the sub-optimal dataset.
Figure 5: Offline scaled conservative Q-learning vs other prior methods with near-optimal data and sub-optimal data. Scaled QL outperforms the best DT model, attaining an IQM human-normalized score of 114.1% on the near-optimal data and 77.8% on the sub-optimal data, compared to 111.8% and 30.6% for DT, respectively.
...and 6 more figures

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

TL;DR

Abstract

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

Authors

TL;DR

Abstract

Table of Contents

Figures (11)