Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Kun Wu; Yichen Zhu; Jinming Li; Junjie Wen; Ning Liu; Zhiyuan Xu; Jian Tang

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Kun Wu, Yichen Zhu, Jinming Li, Junjie Wen, Ning Liu, Zhiyuan Xu, Jian Tang

TL;DR

Problem: learning multi-task visuomotor policies is challenging due to multimodal and entangled action distributions. Approach: Discrete Policy uses a VQ-VAE to discretize actions into a latent codebook and a conditional latent diffusion model to generate task-specific latent embeddings, decoded into actions given language and observations. Contributions: a two-stage training pipeline with discrete latent action space, benefits in disentangling skills, and strong cross-task performance on real-world and simulation benchmarks. Impact: demonstrates scalable, language-conditioned manipulation capable of generalizing to many tasks.

Abstract

Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose \textbf{Discrete Policy}, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26\% higher than Diffusion Policy and 15\% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5\%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 9 figures, 5 tables)

This paper contains 15 sections, 4 equations, 9 figures, 5 tables.

Introduction
Related Work
Preliminaries
Methodology
Overview
Vector Quantized Autoencoder for Multimodal Action
Latent Diffusion Model
Network Architecture
Experiments
Experiment Setup
Experimental Results
Visulization on Action Space
Ablation Study
Skill Composition
Conclusion

Figures (9)

Figure 1: Visualization of Discrete Policy. The t-SNE visualization of feature embeddings from Discrete Policy reveals that skills across different tasks cluster closely together. This pattern suggests that discrete latent spaces are capable of disentangling the complex, multimodal action distributions encountered in multi-task policy learning.
Figure 2: Overview of Discrete Policy. In the first training stage, as indicated by the green arrow, we train a VQ-VAE that maps actions into discrete latent space with an encoder and then reconstructs the actions based on the latent embeddings using a decoder. In the second training stage, as indicated by the brown arrow, we train a latent diffusion model that predicts task-specific latent embeddings to guide the decoder in predicting accurate actions.
Figure 3: Real-World Experiment Setup for single-arm Franka robot. We use two external fixed-view Zed cameras. The figure in the upper right corner shows all the objects used in our experiments.
Figure 4: Real-World Experiment Setup for bimanual UR5 robot. We use four fixed-view RealSense D435i cameras.
Figure 5: Demonstrations of the 12 tasks in single-arm robot experiments.
...and 4 more figures

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

TL;DR

Abstract

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)