Table of Contents
Fetching ...

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning

Younggyo Seo, Pieter Abbeel

TL;DR

This work asks whether predicting and optimizing over action sequences can enhance reinforcement learning for robotics. It introduces Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a critic-only algorithm that outputs Q-values for whole action sequences, thereby improving data efficiency on sparse-reward tasks. Through extensive experiments on BiGym, RLBench, and HumanoidBench, CQN-AS demonstrates superior performance over strong baselines and provides ablations showing the importance of RL objectives, sequence length, and temporal ensemble. The findings suggest action-sequence-based value learning is a practical route to more data-efficient RL in complex robotic domains, with potential extensions to offline, model-based, or vision-enhanced settings.

Abstract

Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning

TL;DR

This work asks whether predicting and optimizing over action sequences can enhance reinforcement learning for robotics. It introduces Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a critic-only algorithm that outputs Q-values for whole action sequences, thereby improving data efficiency on sparse-reward tasks. Through extensive experiments on BiGym, RLBench, and HumanoidBench, CQN-AS demonstrates superior performance over strong baselines and provides ablations showing the importance of RL objectives, sequence length, and temporal ensemble. The findings suggest action-sequence-based value learning is a practical route to more data-efficient RL in complex robotic domains, with potential extensions to offline, model-based, or vision-enhanced settings.

Abstract

Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.

Paper Structure

This paper contains 59 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Summary of results. Coarse-to-fine Q-Network with Action Sequence (CQN-AS) learns a critic network with action sequence. CQN-AS outperforms various RL and BC baselines such as CQN seo2024continuous, DrQ-v2+ yarats2022mastering, and ACT zhao2023learning on 45 robotic tasks from BiGym chernyadev2024bigym and RLBench james2020rlbench.
  • Figure 2: Analyses. (a) We measure the improvement in the validation L1 loss of the return-to-go prediction model with different action sequence lengths. We find that using action sequence of length 50 results in the lower loss than using single-step action. (b) We find that SAC and TD3 with action sequences suffer from severe value overestimation in stand task from HumanoidBench, which leads to random near-zero performance. (c) Actor-critic algorithms like TD3 become vulnerable to value overestimation when redundant no-op actions are added to the action space. In contrast, a critic-only algorithm that uses discrete actions, CQN, is robust with high-dimensional action spaces.
  • Figure 3: Coarse-to-Fine Q-Network with Action Sequence. CQN-AS extends Coarse-to-Fine Q-Network (CQN; seo2024continuous), a critic-only RL algorithm for continuous control using discretized actions. (a) CQN progressively zooms into the action space by discretizing it into $B$ bins and finding the bin with the highest Q-value to further discretize at the next level. Last level's action sequence is used for controlling robots. CQN-AS generalizes this to action sequences by computing all $K$ actions in parallel. (b) We train a critic to predict Q-values over entire action sequences by extracting per-step features and aggregating them with a recurrent network before projection to Q-values.
  • Figure 4: Examples of robotic tasks. We study CQN-AS on 25 humanoid control tasks from BiGym chernyadev2024bigym and 20 tabletop manipulation tasks from RLBench james2020rlbench.
  • Figure 5: BiGym results on 25 sparsely-rewarded mobile bi-manual manipulation tasks. All RL algorithms are trained from scratch, with a replay buffer initialized with 17 to 60 human-collected demonstrations, and with an auxiliary BC objective. We report the success rate over 25 episodes. The solid line and shaded regions represent the mean and confidence intervals, respectively, across 8 runs.
  • ...and 4 more figures