Is Value Learning Really the Main Bottleneck in Offline RL?

Seohong Park; Kevin Frans; Sergey Levine; Aviral Kumar

Is Value Learning Really the Main Bottleneck in Offline RL?

Seohong Park, Kevin Frans, Sergey Levine, Aviral Kumar

TL;DR

<3-5 sentence high-level summary> This paper investigates why offline RL often underperforms relative to imitation and proposes a data-scaling framework to dissect three potential bottlenecks: value learning (B1), policy extraction (B2), and policy generalization to test-time states (B3). Through large-scale experiments with decoupled value learning and policy extraction, it finds that policy extraction choices and test-time generalization frequently constrain performance more than the value function itself, with gradient-based, behavior-constrained policy methods like DDPG+BC delivering superior data scaling. It further demonstrates that test-time generalization bottlenecks can be mitigated by using higher-coverage offline data and by on-the-fly policy-improvement techniques (OPEX, TTT) during evaluation. These findings challenge the prevailing emphasis on improving value-function accuracy and offer practical strategies to improve offline RL in real-world settings, while pointing to new research directions in policy extraction and generalization.

Abstract

While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality by using a value function. However, current results indicate that offline RL often performs worse than imitation learning, and it is often unclear what holds back the performance of offline RL. Motivated by this observation, we aim to understand the bottlenecks in current offline RL algorithms. While poor performance of offline RL is typically attributed to an imperfect value function, we ask: is the main bottleneck of offline RL indeed in learning the value function, or something else? To answer this question, we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance. We make two surprising observations. First, we find that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL, often more so than the value learning objective. For instance, we show that common value-weighted behavioral cloning objectives (e.g., AWR) do not fully leverage the learned value function, and switching to behavior-constrained policy gradient objectives (e.g., DDPG+BC) often leads to substantial improvements in performance and scalability. Second, we find that a big barrier to improving offline RL performance is often imperfect policy generalization on test-time states out of the support of the training data, rather than policy learning on in-distribution states. We then show that the use of suboptimal but high-coverage data or test-time policy training techniques can address this generalization issue in practice. Specifically, we propose two simple test-time policy improvement methods and show that these methods lead to better performance.

Is Value Learning Really the Main Bottleneck in Offline RL?

TL;DR

Abstract

Paper Structure (32 sections, 10 equations, 12 figures, 4 tables)

This paper contains 32 sections, 10 equations, 12 figures, 4 tables.

Introduction
Related work
Main hypothesis
Preliminaries
Empirical analysis 1: Is it the value or the policy? (B1 and B2)
Analysis setup
Value learning objectives
Policy extraction objectives
Environments and datasets
Results: Policy extraction mechanisms substantially affect data-scaling trends
Deep dive 1: How different are the scaling properties of AWR and DDPG+BC?
Deep dive 2: Why is DDPG+BC better than AWR?
Empirical analysis 2: Policy generalization (B3)
Analysis setup
Results: Test-time generalization is often the main bottleneck in offline RL
...and 17 more sections

Figures (12)

Figure 1: Data-scaling matrices of three policy extraction methods (AWR, DDPG+BC, and SfBC) and three value learning methods (IQL and {SARSA or CRL}). To see whether the value or the policy imposes a bigger bottleneck, we measure performance with varying amounts of data for the value and the policy. The color gradients (, , ) of these matrices reveal how the performance of offline RL is bottlenecked in each setting.
Figure 2: Data-scaling matrices of AWR and DDPG+BC with different BC strengths ($\boldsymbol{\alpha}$). In gc-antmaze-large, AWR is always policy-bounded (), but DDPG+BC has both policy-bounded () and value-bounded () modes, depending on the value of $\alpha$. Notably, an in-between value of $\alpha = 1.0$ in DDPG+BC leads to the best of both worlds (see the bottom left corner of gc-antmaze-large with $0.1$M datasets)!
Figure 3: AWR vs. DDPG actions.
Figure 4: AWR overfits.
Figure 5: Three distributions for the MSE metrics.
...and 7 more figures

Is Value Learning Really the Main Bottleneck in Offline RL?

TL;DR

Abstract

Is Value Learning Really the Main Bottleneck in Offline RL?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)