Scaling Offline RL via Efficient and Expressive Shortcut Models
Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, Wen Sun
TL;DR
This paper tackles scaling offline reinforcement learning by introducing Scalable Offline Reinforcement Learning (SORL), a unified, one-stage training approach that uses shortcut models to achieve expressive policies with budget-flexible inference. It introduces self-consistency to allow different numbers of denoising steps across training, regularization, and inference, enabling sequential and parallel test-time scaling via a learned $Q$-function as a verifier. Theoretical results show that the training objective regularizes the learned policy to the behavior policy in Wasserstein distance $W_2$, ensuring safe distributional shift, while experiments on the OGBench suite demonstrate strong performance across diverse tasks and clear gains through test-time compute. The work advances practical offline RL by coupling expressive, diffusion-inspired modeling with efficient, scalable inference, offering a path toward deploying offline policies in compute-varied real-world settings.
Abstract
Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.
