Offline Actor-Critic Reinforcement Learning Scales to Large Models
Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller
TL;DR
This work tackles the challenge of training large transformer-based policies for control using offline reinforcement learning. It introduces the Perceiver-Actor-Critic (PAC) framework, a KL-regularized offline actor-critic with a Perceiver-IO backbone that supports multimodal inputs and cross-attention-based Q-value decoding, enabling scalable learning up to around $1\text{B}$ parameters on a $2.45\text{T}$ token data mix. The key contributions include a principled offline RL objective that blends policy improvement with data-distribution regularization and a scalable architecture that supports joint learning from proprioception, vision, and language. Empirically, PAC outperforms strong BC baselines across 132 tasks, demonstrates favorable scaling laws comparable to supervised learning, and enables RL-fine-tuning with self-generated data on real robots, achieving high mastery without online exploration. This work suggests offline actor-critic methods can scale with model size and dataset breadth to produce versatile, multi-task control policies from sub-optimal demonstrations, with potential to integrate with pre-trained multimodal models in the future.
Abstract
We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
