Scalable Offline Model-Based RL with Action Chunks
Kwanyoung Park, Seohong Park, Youngwoon Lee, Sergey Levine
TL;DR
The paper tackles the challenge of scaling offline model-based reinforcement learning to long-horizon tasks by introducing Model-Based RL with Action Chunks (MAC). MAC uses a multi-step action-chunk dynamics model and a flow-based action-chunk policy with rejection sampling to generate long, in-distribution imaginary rollouts while mitigating compounding model errors. Empirical results on up to 100M transitions show MAC achieving state-of-the-art performance among offline MB-RL methods on challenging long-horizon manipulation tasks, with ablations confirming the importance of action-chunk length, flow rejection sampling, and distillation. Limitations remain in contact-rich locomotion domains, suggesting future work on more expressive dynamics models, but the method provides a scalable, reproducible recipe for offline horizon-scale RL.
Abstract
In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.
