Table of Contents
Fetching ...

Scalable Offline Model-Based RL with Action Chunks

Kwanyoung Park, Seohong Park, Youngwoon Lee, Sergey Levine

TL;DR

The paper tackles the challenge of scaling offline model-based reinforcement learning to long-horizon tasks by introducing Model-Based RL with Action Chunks (MAC). MAC uses a multi-step action-chunk dynamics model and a flow-based action-chunk policy with rejection sampling to generate long, in-distribution imaginary rollouts while mitigating compounding model errors. Empirical results on up to 100M transitions show MAC achieving state-of-the-art performance among offline MB-RL methods on challenging long-horizon manipulation tasks, with ablations confirming the importance of action-chunk length, flow rejection sampling, and distillation. Limitations remain in contact-rich locomotion domains, suggesting future work on more expressive dynamics models, but the method provides a scalable, reproducible recipe for offline horizon-scale RL.

Abstract

In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.

Scalable Offline Model-Based RL with Action Chunks

TL;DR

The paper tackles the challenge of scaling offline model-based reinforcement learning to long-horizon tasks by introducing Model-Based RL with Action Chunks (MAC). MAC uses a multi-step action-chunk dynamics model and a flow-based action-chunk policy with rejection sampling to generate long, in-distribution imaginary rollouts while mitigating compounding model errors. Empirical results on up to 100M transitions show MAC achieving state-of-the-art performance among offline MB-RL methods on challenging long-horizon manipulation tasks, with ablations confirming the importance of action-chunk length, flow rejection sampling, and distillation. Limitations remain in contact-rich locomotion domains, suggesting future work on more expressive dynamics models, but the method provides a scalable, reproducible recipe for offline horizon-scale RL.

Abstract

In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Two main components of MAC. (Left) Action-chunk models predict a future state given a sequence of actions (an "action chunk"), reducing compounding errors and enabling long-horizon model rollouts. (Right) Rejection sampling from an expressive (flow) behavioral action-chunk policy enables modeling multi-modal action distributions, while preventing model exploitation from out-of-distribution actions.
  • Figure 2: Action chunking reduces model errors.
  • Figure 3: Action chunk length vs. performance.
  • Figure 4: OGBench tasks.
  • Figure 5: Training curve of MAC in large-scale, long-horizon environments. We report the success rate for 15 evaluation episodes across 4 seeds (total 60 episodes). Shaded region represents the [mean - std, mean + std].
  • ...and 3 more figures