Table of Contents
Fetching ...

RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

Wenhui Tan, Bei Liu, Junbo Zhang, Ruihua Song, Jianlong Fu

TL;DR

This study proposes a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces, enhancing the generalization ability of policy generation on multi-task manipulation tasks.

Abstract

Modeling generalized robot control policies poses ongoing challenges for language-guided robot manipulation tasks. Existing methods often struggle to efficiently utilize cross-dataset resources or rely on resource-intensive vision-language models, thus limiting their multi-task performance and practical applications. In this study, we propose a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces, enhancing the generalization ability of policy generation on multi-task manipulation tasks. First, we pre-train a task-agnostic auto-encoder to project an action trajectory of several frames accompanied with observations into a latent action trajectory space on large-scale datasets collected with multiple embodiments in various environments. Then we propose learning a diffusion model based on the latent action trajectory space to generate actions of next steps. Through experiments on two widely used benchmarks, results demonstrate that our proposed method outperforms baselines by 7%-29% in terms of average success rate across eight tasks. Our method can consistently benefit from pre-training while baselines cannot. Our method is more than two times faster than our baseline.

RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

TL;DR

This study proposes a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces, enhancing the generalization ability of policy generation on multi-task manipulation tasks.

Abstract

Modeling generalized robot control policies poses ongoing challenges for language-guided robot manipulation tasks. Existing methods often struggle to efficiently utilize cross-dataset resources or rely on resource-intensive vision-language models, thus limiting their multi-task performance and practical applications. In this study, we propose a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces, enhancing the generalization ability of policy generation on multi-task manipulation tasks. First, we pre-train a task-agnostic auto-encoder to project an action trajectory of several frames accompanied with observations into a latent action trajectory space on large-scale datasets collected with multiple embodiments in various environments. Then we propose learning a diffusion model based on the latent action trajectory space to generate actions of next steps. Through experiments on two widely used benchmarks, results demonstrate that our proposed method outperforms baselines by 7%-29% in terms of average success rate across eight tasks. Our method can consistently benefit from pre-training while baselines cannot. Our method is more than two times faster than our baseline.
Paper Structure (16 sections, 6 equations, 5 figures, 2 tables)

This paper contains 16 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Flowchart of our proposed method.
  • Figure 2: RoLD comprises two core components: 1) a Action Trajectory Auto-encoder (ATA) for unified action trajectory modeling in a condense latent space and 2) a Latent diffusion based Policy Generator (LPG) that iteratively denoises sampled noise to recover target trajectory latent $z$, conditioned by observations and instructions for efficient policy modeling. In this figure, $cls$, $\{a_i\}$ and $\{p_i\}$ denote learnable token, action and position embedding, respectively. $N_E$/$N_D$ represent the number of layers of Action Trajectory Encoder/Decoder. The $f_{obs}$ tokens are identical in this figure.
  • Figure 3: RoLD is able to generalize to diverse tasks conditioned on instructions.
  • Figure 4: Relative success rate changes of RoLD and baseline methods being pre-trained on our processed Open-X-Embodiments dataset. The methods are evaluated on Robomimic and Meta-World.
  • Figure 5: t-SNE visualization of ATA's latent space and success rates on Robomimic with different horizon lengths. We use ATA to encode trajectories from Robomimic dataset into latent variables (data points in this figure), then manually check and label them with different action classes.