Table of Contents
Fetching ...

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang

TL;DR

AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences, achieves superior success rates on most tasks, validating its effective design.

Abstract

Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk $n$-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

TL;DR

AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences, achieves superior success rates on most tasks, validating its effective design.

Abstract

Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk -step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.

Paper Structure

This paper contains 17 sections, 14 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Imitation Learning is not robust to unseen states. (b) Hybrid RL with discrete chunks lacks precision. (c) Our approach, AC3, directly learns continuous chunks for more effective control. (d) AC3 achieves superior performance.
  • Figure 2: Overall framework of AC3. First, a Goal Network is pre-trained using expert data via self-supervised learning to provide intrinsic rewards $r_{\text{int}}$ during subsequent online interactions. Next, during online interaction, the Actor outputs a continuous action chunk and stores new experiences in the Replay Buffer after execution. For training, the Critic is updated via an intra-chunk $n$-step TD loss, while the Actor learns only from the successful trajectories buffer $\mathcal{B}_{\text{succ}}$ to promote stable policy improvement.
  • Figure 3: The performance of 15 bi-manual mobile manipulation tasks in BiGym. All tasks use 10 expert demonstrations as offline data, and all RL algorithms use an auxiliary BC loss for exploration guidance. The solid line and the shaded regions represent the mean performance and standard deviation, respectively.
  • Figure 4: The performance of 10 tabletop manipulation tasks in RLBench. All tasks use 100 synthetic demonstrations as offline data, and all RL algorithms use an auxiliary BC loss for exploration guidance. The solid line and the shaded regions represent the mean performance and standard deviation, respectively.
  • Figure 5: Effect of action chunk length.
  • ...and 8 more figures