Table of Contents
Fetching ...

Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine

TL;DR

This work investigates the scalability of offline RL on long-horizon tasks by scaling data to up to 1B transitions per environment, revealing horizon length as a fundamental bottleneck. It demonstrates that standard offline RL methods struggle to scale despite large data and provides controlled experiments showing biases in value learning and complexity in policy learning as horizon grows. The authors propose horizon-reduction techniques, including $n$-step value updates, hierarchical policy structures, and SHARSA, a minimal two-level horizon-reduction method based on flow behavioral cloning and SARSA. Across challenging OGBench tasks, horizon-reduction methods improve scaling and asymptotic performance, with SHARSA delivering the strongest results, while still leaving open questions about fully solving long-horizon offline RL at scale. The work advocates for scalability-focused evaluation in offline RL and releases datasets and code to spur future advances.

Abstract

In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction

Horizon Reduction Makes RL Scalable

TL;DR

This work investigates the scalability of offline RL on long-horizon tasks by scaling data to up to 1B transitions per environment, revealing horizon length as a fundamental bottleneck. It demonstrates that standard offline RL methods struggle to scale despite large data and provides controlled experiments showing biases in value learning and complexity in policy learning as horizon grows. The authors propose horizon-reduction techniques, including -step value updates, hierarchical policy structures, and SHARSA, a minimal two-level horizon-reduction method based on flow behavioral cloning and SARSA. Across challenging OGBench tasks, horizon-reduction methods improve scaling and asymptotic performance, with SHARSA delivering the strongest results, while still leaving open questions about fully solving long-horizon offline RL at scale. The work advocates for scalability-focused evaluation in offline RL and releases datasets and code to spur future advances.

Abstract

In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction

Paper Structure

This paper contains 24 sections, 21 equations, 51 figures, 6 tables, 2 algorithms.

Figures (51)

  • Figure 1: Horizon reduction makes RL scalable. Standard offline RL methods struggle to scale on highly challenging tasks, not improving performance with more data. We show that this is mainly because the long horizon can fundamentally inhibit scaling, and that horizon reduction techniques unlock the scaling of offline RL.
  • Figure 2: Standard offline RL methods struggle to scale on challenging tasks. We train four offline RL methods with $1$M, $10$M, $100$M, and $1$B data on four complex, long-horizon tasks. However, even with $1$B data, their performance often saturates far below the maximum performance ($100\%$).
  • Figure 3: Increasing model capacity alone is not sufficient to master the tasks.
  • Figure 4: The combination-lock task with $\mathbf{H = 512}$ states.
  • Figure 5: $\mathbf{1}$-step TD learning suffers bias accumulation (i.e., high Q errors).
  • ...and 46 more figures