Table of Contents
Fetching ...

Expressive Value Learning for Scalable Offline Reinforcement Learning

Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun

TL;DR

EVOR tackles the scalability challenge of offline reinforcement learning for robotics by jointly enabling expressive policies and expressive value functions through flow matching. It avoids policy distillation and backpropagation through time by performing inference-time policy extraction via rejection sampling guided by an optimal, regularized Q-function derived from a learned reward-to-go distribution. The method leverages flow-based TD learning to model the reward-to-go distribution and computes $Q^{\star}$ for robust action selection at test time, with test-time regularization and search depth controlled by simple hyperparameters. Empirically, EVOR achieves superior performance across diverse OGBench tasks and demonstrates that increasing inference-time compute improves results up to a saturation point, validating the value of expressive value learning for scalable offline RL in robotics.

Abstract

Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.

Expressive Value Learning for Scalable Offline Reinforcement Learning

TL;DR

EVOR tackles the scalability challenge of offline reinforcement learning for robotics by jointly enabling expressive policies and expressive value functions through flow matching. It avoids policy distillation and backpropagation through time by performing inference-time policy extraction via rejection sampling guided by an optimal, regularized Q-function derived from a learned reward-to-go distribution. The method leverages flow-based TD learning to model the reward-to-go distribution and computes for robust action selection at test time, with test-time regularization and search depth controlled by simple hyperparameters. Empirically, EVOR achieves superior performance across diverse OGBench tasks and demonstrates that increasing inference-time compute improves results up to a saturation point, validating the value of expressive value learning for scalable offline RL in robotics.

Abstract

Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.

Paper Structure

This paper contains 47 sections, 1 theorem, 36 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Under deterministic transitions, the optimal value and $Q$-functions are given by

Figures (5)

  • Figure 1: EVOR's Inference-Time Scaling.EVOR can perform inference-time scaling by increasing the number of action candidates ${N_{\pi}}$, performing greater search at inference time with the expressive value function. Leveraging greater inference-time compute results in better performance, up to a saturation point. Results are averaged over three seeds per task, with standard deviations reported.
  • Figure 2: Ablation Over EVOR's Evaluation Parameters.EVOR uses the same training parameters for all environments in this paper. However, we investigate the effect of varying the temperature parameters $\tau_R$ and $\tau_Q$ at inference-time on the performance of EVOR. As $\tau_Q$ decreases, the action selection becomes more greedy, while as $\tau_Q$ increases, the action selection becomes more regularized. Set to a high value, EVOR becomes equivalent to the base policy (i.e., the performance with ${N_{\pi}}=1$). Results are averaged over three seeds per task, with standard deviations reported.
  • Figure 3: Ablation Over Number of Action Candidates ${N_{\pi}}$. Results are averaged over three seeds per task, with standard deviations reported.
  • Figure 4: Ablation Over Reward-To-Go Temperature Parameter $\tau_R$. Results are averaged over three seeds per task, with standard deviations reported.
  • Figure 5: Ablation Over $Q^\star$ Temperature Parameter $\tau_Q$. Results are averaged over three seeds per task, with standard deviations reported.

Theorems & Definitions (1)

  • Theorem 1: Optimal Regularized Value Functions zhou2025q