Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng; Ruixi Qiao; Yingwei Ma; Binhua Li; Gang Xiong; Qinghai Miao; Yongbin Li; Yisheng Lv

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng, Ruixi Qiao, Yingwei Ma, Binhua Li, Gang Xiong, Qinghai Miao, Yongbin Li, Yisheng Lv

TL;DR

This paper introduces JOWA, a single offline model-based RL agent that jointly optimizes a world-model and Q-value critic via a shared transformer backbone, pretrained on diverse Atari data to learn general representations and decision-making. A planning module enables parallelizable search over imagined trajectories to compensate for Q-value estimation errors, enabling robust inference and transfer. Results show that scaling model capacity unlocks performance gains, with JOWA-150M achieving $78.9\%$ IQM HNS on 15 pretrained games using only 10% of the data, and 64.7% IQM DNS when fine-tuning on 5 unseen games with ~5k transitions. The approach demonstrates strong cross-task generalization and sample-efficient transfer, highlighting the value of joint world-action pretraining and planning in offline multi-task RL.

Abstract

A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

TL;DR

IQM HNS on 15 pretrained games using only 10% of the data, and 64.7% IQM DNS when fine-tuning on 5 unseen games with ~5k transitions. The approach demonstrates strong cross-task generalization and sample-efficient transfer, highlighting the value of joint world-action pretraining and planning in offline multi-task RL.

Abstract

Paper Structure (50 sections, 1 theorem, 27 equations, 2 figures, 21 tables)

This paper contains 50 sections, 1 theorem, 27 equations, 2 figures, 21 tables.

Introduction
Related work
Offline Reinforcement Learning.
Mutli-Task Reinforcement Learning.
Preliminaries and problem setup
Online distributional rl (C51)
Value regularization based Offline rl (CQL)
Problem setup
Jointly-Optimized World-Action Model
World-Action Model
Architecture
Training of World-part Module
Training of Action-part Module
Parallelizable planning at inference time
Training pipeline
...and 35 more sections

Key Result

Theorem C.2

Define $s_t,a_t$ to be the states and actions resulting from current policy using ground-truth dynamics $P$ and reward function $r$ and similarly define $s_t^{'},a_t^{'}$ using learned functions $\hat{P}$ and $\hat{r}$. Assume the learned reward function $\hat{r}$ to be $L_r$-Lipschitz and the estim

Figures (2)

Figure 1: Architecture of JOWA. We use a shared transformer backbone for both world modeling and Q-value criticism to enable joint optimization. VQ-VAE tokenizes images into visual tokens. The sum of vocabulary embeddings, position embeddings and task embeddings forms the input embeddings space for the transformer backbone.
Figure 2: Scaling trends for different algorithms on the training set games.

Theorems & Definitions (2)

Theorem C.2
proof

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

TL;DR

Abstract

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)