Table of Contents
Fetching ...

Latent Action Diffusion for Cross-Embodiment Manipulation

Erik Bauer, Elvis Nava, Robert K. Katzschmann

TL;DR

This work tackles the embodiment gap in robotic manipulation by learning a semantically aligned latent action space that unifies diverse end-effectors. It combines retargeting, contrastive latent space learning, and diffusion-policy training to enable a single, embodiment-agnostic policy with embodiment-specific decoders. Empirical results across dexterous and general-purpose end-effectors show substantial cross-embodiment transfer gains (up to 25.3% in success rates) and robust performance improvements, validated through ablations. The approach reduces data collection needs for new robots and enables scalable data sharing across robot morphologies, advancing practical multi-robot learning.

Abstract

End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.

Latent Action Diffusion for Cross-Embodiment Manipulation

TL;DR

This work tackles the embodiment gap in robotic manipulation by learning a semantically aligned latent action space that unifies diverse end-effectors. It combines retargeting, contrastive latent space learning, and diffusion-policy training to enable a single, embodiment-agnostic policy with embodiment-specific decoders. Empirical results across dexterous and general-purpose end-effectors show substantial cross-embodiment transfer gains (up to 25.3% in success rates) and robust performance improvements, validated through ablations. The approach reduces data collection needs for new robots and enables scalable data sharing across robot morphologies, advancing practical multi-robot learning.

Abstract

End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The three-stage process for learning the cross-embodiment latent action space. Stage 1: Aligned end-effector (EEF) poses are generated by retargeting human hand poses to different robot end-effectors. Stage 2: Embodiment-specific encoders are trained to project these actions into a shared latent space using a contrastive loss. Stage 3: Decoders are trained to reconstruct the original poses from the latent space, and encoders are fine-tuned to improve reconstruction quality.
  • Figure 2: Qualitative evaluation of the joint latent action space. We encode normalized gripper widths $W \in [0,1]$ (from closed to open) and perform cross-modal reconstruction by decoding them into human hand poses (colored lines on left) and poses for the Faive hand (grey model on right). Existing approaches using retargeting only allow for single-directional retargeting (i.e. human hands to robot hands), which is a limitation our latent action space overcomes. Any modality can be encoded and decoded to any other modality under the alignment constraints of the data.
  • Figure 3: Success rates for three different tasks comparing single-embodiment diffusion policies to cross-embodied latent diffusion policies trained on data from both embodiments for each task. Block stacking: 200 demos per embodiment, one external camera. Block pick and place: 200 demos per embodiment, one external camera + wrist camera for mimic hand, replaced by zero-padding for Franka gripper. Plush toy pick and place: 100 demos per embodiment, one external camera.
  • Figure 4: Cross-embodiment policy rollouts for two pick and place tasks in different settings. The robots in each respective setting (left: mimic hand, middle: mimic hand, Franka gripper, right: Faive hand, Franka gripper) are controlled by a single cross-embodiment diffusion policy, demonstrating multi-robot control.