Table of Contents
Fetching ...

DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, Yuke Zhu

TL;DR

The paper tackles offline RL in long-horizon tasks by introducing DEAS, which leverages temporally extended action sequences as primitive options to reduce the planning horizon. It combines detached value learning (training the critic separately from the actor) with distributional RL and dual discount factors to stabilize learning and prevent value overestimation when using action sequences. Empirically, DEAS achieves state-of-the-art results on challenging OGBench tasks and improves performance of large Vision-Language-Action models on RoboCasa Kitchen and real-world manipulation tasks, demonstrating practical scalability. The approach is compatible with diverse policy-extraction methods and offers a reproducible implementation, highlighting its potential for real-world offline RL in robotics and beyond.

Abstract

Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.

DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

TL;DR

The paper tackles offline RL in long-horizon tasks by introducing DEAS, which leverages temporally extended action sequences as primitive options to reduce the planning horizon. It combines detached value learning (training the critic separately from the actor) with distributional RL and dual discount factors to stabilize learning and prevent value overestimation when using action sequences. Empirically, DEAS achieves state-of-the-art results on challenging OGBench tasks and improves performance of large Vision-Language-Action models on RoboCasa Kitchen and real-world manipulation tasks, demonstrating practical scalability. The approach is compatible with diverse policy-extraction methods and offers a reproducible implementation, highlighting its potential for real-world offline RL in robotics and beyond.

Abstract

Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.

Paper Structure

This paper contains 62 sections, 8 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview. DEAS is an offline RL framework that learns from action sequences instead of single actions. Unlike previous methods that couple actor-critic training, our key insight is to train the critic separately from the policy (detached value learning) using action sequences, which enables stable learning while avoiding value overestimation. We further enhance stability by combining distributional RL objectives and using dual discount factors, which leads to additional improvement.
  • Figure 2: Simulation task examples. We study DEAS on 30 different tasks from OGBench park2025ogbench and 4 challenging manipulation tasks from RoboCasa Kitchen nasiriany2024robocasa.
  • Figure 3: Agent performance across varying dataset sizes on three representative OGBench park2025ogbench tasks, evaluated by success rate (%). Solid lines indicate the mean, while shaded areas denote the stratified bootstrap confidence intervals over 4 independent runs.
  • Figure 4: Real-world tasks. We conduct pick-and-place tasks from the countertop to the bottom cabinet with $\tt{peach}$, $\tt{milka}$, and $\tt{hichew}$.
  • Figure 5: Real-robot platform.
  • ...and 1 more figures