Table of Contents
Fetching ...

Scaling Offline RL via Efficient and Expressive Shortcut Models

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, Wen Sun

TL;DR

This paper tackles scaling offline reinforcement learning by introducing Scalable Offline Reinforcement Learning (SORL), a unified, one-stage training approach that uses shortcut models to achieve expressive policies with budget-flexible inference. It introduces self-consistency to allow different numbers of denoising steps across training, regularization, and inference, enabling sequential and parallel test-time scaling via a learned $Q$-function as a verifier. Theoretical results show that the training objective regularizes the learned policy to the behavior policy in Wasserstein distance $W_2$, ensuring safe distributional shift, while experiments on the OGBench suite demonstrate strong performance across diverse tasks and clear gains through test-time compute. The work advances practical offline RL by coupling expressive, diffusion-inspired modeling with efficient, scalable inference, offering a path toward deploying offline policies in compute-varied real-world settings.

Abstract

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.

Scaling Offline RL via Efficient and Expressive Shortcut Models

TL;DR

This paper tackles scaling offline reinforcement learning by introducing Scalable Offline Reinforcement Learning (SORL), a unified, one-stage training approach that uses shortcut models to achieve expressive policies with budget-flexible inference. It introduces self-consistency to allow different numbers of denoising steps across training, regularization, and inference, enabling sequential and parallel test-time scaling via a learned -function as a verifier. Theoretical results show that the training objective regularizes the learned policy to the behavior policy in Wasserstein distance , ensuring safe distributional shift, while experiments on the OGBench suite demonstrate strong performance across diverse tasks and clear gains through test-time compute. The work advances practical offline RL by coupling expressive, diffusion-inspired modeling with efficient, scalable inference, offering a path toward deploying offline policies in compute-varied real-world settings.

Abstract

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.

Paper Structure

This paper contains 65 sections, 5 theorems, 58 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Theorem 2

Suppose the shortcut model $s(z,t,h)$ is $L$-Lipschitz in $z$ for all $t$ and $h$, the drift function $v_t(z)$ is $L_v$-Lipschitz in $z$ for all $t$, $\sup_t\mathbb{E}_{z_t\sim p_t}\left[\left\|v_t\right\|_2^2\right] \le M_v$ and $L/M < 1$. If Assumption assum:fm-cl-err holds, then for all $h = \fra where $\hat{p}^{(h)}$ is the distribution of samples generated by the shortcut model with step size

Figures (5)

  • Figure 1: SORL's Sequential Scaling. For a fixed training budget, SORL generally improves performance with greater test-time compute. We fix a training budget of discretization steps and backpropagation steps through time ($M^{\text{disc}} = M^{\text{BTT}} = 8$) and vary the inference budget via the number of inference steps $M^{\text{inf}}$. The performance is averaged over 8 seeds for each task, with 5 tasks per environment, and standard deviations reported.
  • Figure 2: SORL's Parallel Scaling.SORL generalizes to new inference steps at test-time, beyond what was optimized through backpropagation during training. For each fixed training budget (i.e. fixed number of discretization steps $M^{\text{disc}}$ and backpropagation through time steps $M^{\text{BTT}}$), we evaluate with varying inference steps $M^{\text{inf}}$. $M^{\text{BTT}}$ denotes the maximum number of steps used for backpropagation through time in the $Q$ update. The $\star$ hatch denotes best-of-$N$ sampling, with $N=8$, where the number of inference steps is greater than the number of backpropagation steps through time (i.e. $M^{\text{inf}} > M^{\text{BTT}}$). $\texttt{SORL}\xspace^{\star}$ denotes the best performance achieved by SORL in Table \ref{['table:offline_table_envs']}. Results are averaged over 8 seeds for each of the 5 tasks.
  • Figure 3: Runtime Comparison. We vary SORL's training-time compute budget (i.e. the number of backpropagation steps through time $M^{\text{BTT}}$) on the left and SORL's inference-time compute budget (i.e. the number of inference steps $M^{\text{inf}}$) on the right. The performance is averaged over 5 seeds for each task, with 5 tasks per environment, and standard deviations reported.
  • Figure 4: Ablation Over Backpropagation Steps Through Time, $M^{\text{BTT}}$. We investigate the effect of varying the training-time compute budget (i.e. the number of backpropagation steps through time $M^{\text{BTT}}$). The performance is averaged over 8 seeds for each task, with 5 tasks per environment, and standard deviations reported. We report results using one inference step (i.e. $M^{\text{inf}}=1$) and using the same number of inference steps as backpropagation steps through time (i.e. $M^{\text{inf}}=M^{\text{BTT}}$).
  • Figure 5: Ablation Over Policy Network Size. The performance is averaged over 5 seeds for each task, with 5 tasks per environment, and standard deviations reported. We use the same training-time and inference-time compute budgets for SORL as we use for Tables \ref{['table:offline_table_envs']} and \ref{['table:offline_table_tasks']} (i.e. $M^{\text{BTT}}=8$ and $M^{\text{inf}}=4$). We train and evaluate FQL with the parameters used in the official implementation park2025flow. The only change to SORL and FQL is varying the sizes of the policy network's hidden layers.

Theorems & Definitions (5)

  • Theorem 2: Regularization To Behavior Policy
  • Theorem 3: Restatement of Theorem \ref{['thm:shortcut-conv']} With Explicit Dependency on $h$
  • Lemma 4: Single-Step Error with Minimum Step-Size
  • Lemma 5: Single-Step Error with Step Size $h$
  • Lemma 6: Error of $1/h$-Step Inference