Offline Actor-Critic Reinforcement Learning Scales to Large Models

Jost Tobias Springenberg; Abbas Abdolmaleki; Jingwei Zhang; Oliver Groth; Michael Bloesch; Thomas Lampe; Philemon Brakel; Sarah Bechtle; Steven Kapturowski; Roland Hafner; Nicolas Heess; Martin Riedmiller

Offline Actor-Critic Reinforcement Learning Scales to Large Models

Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller

TL;DR

This work tackles the challenge of training large transformer-based policies for control using offline reinforcement learning. It introduces the Perceiver-Actor-Critic (PAC) framework, a KL-regularized offline actor-critic with a Perceiver-IO backbone that supports multimodal inputs and cross-attention-based Q-value decoding, enabling scalable learning up to around $1\text{B}$ parameters on a $2.45\text{T}$ token data mix. The key contributions include a principled offline RL objective that blends policy improvement with data-distribution regularization and a scalable architecture that supports joint learning from proprioception, vision, and language. Empirically, PAC outperforms strong BC baselines across 132 tasks, demonstrates favorable scaling laws comparable to supervised learning, and enables RL-fine-tuning with self-generated data on real robots, achieving high mastery without online exploration. This work suggests offline actor-critic methods can scale with model size and dataset breadth to produce versatile, multi-task control policies from sub-optimal demonstrations, with potential to integrate with pre-trained multimodal models in the future.

Abstract

We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

Offline Actor-Critic Reinforcement Learning Scales to Large Models

TL;DR

parameters on a

token data mix. The key contributions include a principled offline RL objective that blends policy improvement with data-distribution regularization and a scalable architecture that supports joint learning from proprioception, vision, and language. Empirically, PAC outperforms strong BC baselines across 132 tasks, demonstrates favorable scaling laws comparable to supervised learning, and enables RL-fine-tuning with self-generated data on real robots, achieving high mastery without online exploration. This work suggests offline actor-critic methods can scale with model size and dataset breadth to produce versatile, multi-task control policies from sub-optimal demonstrations, with potential to integrate with pre-trained multimodal models in the future.

Abstract

Paper Structure (51 sections, 19 equations, 9 figures, 19 tables)

This paper contains 51 sections, 19 equations, 9 figures, 19 tables.

Introduction
Background and Related Work
Supervised Generalist Agents
Offline RL
Scaling Law Analysis
Scalable Offline Actor-Critic Learning
Background and Notation
Offline KL-Regularized Actor-Critic
Scalable Architecture for Actor-Critic Learning
Observation Encoding
Transformer on Latent Space
Policy and Value Decoding
Experiments
Scaling Analysis for Offline RL Objectives
Large-scale Offline Actor-Critic Learning
...and 36 more sections

Figures (9)

Figure 1: PAC is a scalable neural architecture for continuous control able to smoothly interpolate between BC and offline RL. The system design enables training on heterogenous, multi-modal data of varying quality. We demonstrate that our system achieves higher performance than BC across a series of model scales. The method also enables a seamless transition into offline and online RL finetuning for fast adaptation and mastery of control tasks.
Figure 2: High-level PAC model architecture. Modality-specific encoders transform proprioceptive (P), visual (V), and language (L) inputs into embedding vectors $e_I$, which are cross-attended by learnable latent queries $z_0$. This is followed by a series of self-attention blocks to yield the latent encoding $z_M$, which is then queried via additional cross-attention modules to decode the desired outputs. The policy decoder employs a learnable query $q_{\pi}$ to cross-attend $z_M$ and outputs the logits of action distributions. The Q-value decoder employs a query $q_Q$ based on the encoded actions to cross-attend $z_M$ and outputs the action-specific logits of the distributional Q-function.
Figure 3: Scaling laws based on the return profile envelope for PAC. We select 100 logarithmically spaced points between 5E+18 and 5E+20 FLOPs on the envelope of the return profiles (left) for the scaling law fits. For both the token and parameter scaling plots (middle, right), we indicate the scaling trend with a dashed red line. The green intersection represents the optimality point when training on a single epoch of our data while the teal intersection represents the optimal data and parameter trade-off for a FLOP budget of 1E+21.
Figure 4: Iso-Return comparison of BC+Q vs PAC. The return profile (top) contrasts the expected average return between the BC baseline and the RL objective across all model scales. The Iso-Return contours (bottom) depict how the reward landscape over the parameter-FLOPs landscape shifts between using the BC objective (dashed contours) and the RL objectives (solid contours).
Figure 5: A selection of the domains and tasks in our data mix. Top left: Control Suite features 32 different continuous control tasks across 15 different embodiments with a great variance in proprioception and action spaces. Top right: Stacking RGB objects into different configurations (pyramides and towers) with a simulated Panda arm. Bottom left: Inserting gears onto pegs in simulation. Bottom right: Performing the RGB stacking task on a real Sawyer robot.
...and 4 more figures

Offline Actor-Critic Reinforcement Learning Scales to Large Models

TL;DR

Abstract

Offline Actor-Critic Reinforcement Learning Scales to Large Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)