Table of Contents
Fetching ...

Scaling Cross-Embodiment World Models for Dexterous Manipulation

Zihao He, Bo Ai, Tongzhou Mu, Yulin Liu, Weikang Wan, Jiawei Fu, Yilun Du, Henrik I. Christensen, Hao Su

TL;DR

This paper tackles the challenge of cross-embodiment dexterous manipulation by positing that environment dynamics are embodiment-invariant and can be captured with a unified world model. It adopts a particle-based state and action representation, where hands and objects are sets of 3D particles and actions are particle displacements, and trains a graph-based dynamics model to predict future states. The model is learned from diverse simulated robot hands and real human hands and is deployed via model-based planning that maps joint actions to the particle space through forward kinematics. Key findings show that increasing the number of training embodiments improves generalization to unseen morphologies, co-training simulated and real data yields benefits beyond either alone, and the learned models can control hands with varied degrees of freedom, including deformable-object manipulation; collectively, the work presents world models as a promising interface for cross-embodiment dexterous manipulation.

Abstract

Cross-embodiment learning seeks to build generalist robots that operate across diverse morphologies, but differences in action spaces and kinematics hinder data sharing and policy transfer. This raises a central question: Is there any invariance that allows actions to transfer across embodiments? We conjecture that environment dynamics are embodiment-invariant, and that world models capturing these dynamics can provide a unified interface across embodiments. To learn such a unified world model, the crucial step is to design state and action representations that abstract away embodiment-specific details while preserving control relevance. To this end, we represent different embodiments (e.g., human hands and robot hands) as sets of 3D particles and define actions as particle displacements, creating a shared representation for heterogeneous data and control problems. A graph-based world model is then trained on exploration data from diverse simulated robot hands and real human hands, and integrated with model-based planning for deployment on novel hardware. Experiments on rigid and deformable manipulation tasks reveal three findings: (i) scaling to more training embodiments improves generalization to unseen ones, (ii) co-training on both simulated and real data outperforms training on either alone, and (iii) the learned models enable effective control on robots with varied degrees of freedom. These results establish world models as a promising interface for cross-embodiment dexterous manipulation.

Scaling Cross-Embodiment World Models for Dexterous Manipulation

TL;DR

This paper tackles the challenge of cross-embodiment dexterous manipulation by positing that environment dynamics are embodiment-invariant and can be captured with a unified world model. It adopts a particle-based state and action representation, where hands and objects are sets of 3D particles and actions are particle displacements, and trains a graph-based dynamics model to predict future states. The model is learned from diverse simulated robot hands and real human hands and is deployed via model-based planning that maps joint actions to the particle space through forward kinematics. Key findings show that increasing the number of training embodiments improves generalization to unseen morphologies, co-training simulated and real data yields benefits beyond either alone, and the learned models can control hands with varied degrees of freedom, including deformable-object manipulation; collectively, the work presents world models as a promising interface for cross-embodiment dexterous manipulation.

Abstract

Cross-embodiment learning seeks to build generalist robots that operate across diverse morphologies, but differences in action spaces and kinematics hinder data sharing and policy transfer. This raises a central question: Is there any invariance that allows actions to transfer across embodiments? We conjecture that environment dynamics are embodiment-invariant, and that world models capturing these dynamics can provide a unified interface across embodiments. To learn such a unified world model, the crucial step is to design state and action representations that abstract away embodiment-specific details while preserving control relevance. To this end, we represent different embodiments (e.g., human hands and robot hands) as sets of 3D particles and define actions as particle displacements, creating a shared representation for heterogeneous data and control problems. A graph-based world model is then trained on exploration data from diverse simulated robot hands and real human hands, and integrated with model-based planning for deployment on novel hardware. Experiments on rigid and deformable manipulation tasks reveal three findings: (i) scaling to more training embodiments improves generalization to unseen ones, (ii) co-training on both simulated and real data outperforms training on either alone, and (iii) the learned models enable effective control on robots with varied degrees of freedom. These results establish world models as a promising interface for cross-embodiment dexterous manipulation.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overall framework. Our key idea is to represent both embodiments and objects as 3D particles, and actions as end-effector particle displacement fields. These state–action abstractions unify data and control across embodiments. (a) We train world models on random interaction data from diverse robot hands in simulation and from human demonstrations in the real world. (b) At deployment, joint action samples are mapped into displacement fields via forward kinematics, rolled out by the world model for prediction, and the optimal trajectory is executed on the target hardware. We show a single-step planning horizon here for simplicity.
  • Figure 2: Scaling trends in cross-embodiment world model learning. For each target hand, models are trained on subsets of the remaining hands of varying sizes. All subset combinations at a given size are enumerated (e.g., $\binom{5}{2}$ for size 2), and the mean performance with 95% confidence intervals is reported. Dashed lines indicate models directly trained on the target embodiment.
  • Figure 3: Cross-embodiment setups in simulation and the real world. We have multiple robotic hands in simulation for collecting random interaction data, and two real hardware mounted on a UFACTORY XArm 7 for system deployment.
  • Figure 4: Qualitative results of cross-embodiment deployment. (a) Ability Hand (6-DoF) and (b) XHand (12-DoF) utilize the same particle‑space dynamics model learned from human demonstration. For each trial, the hand successfully reshapes the deformable clay toward the target shape using a combination of FingersPinch, PalmPress, and ThumbPinch skills.
  • Figure 5: Evaluating training recipes for bridging simulation and real. We compare co-training with different mixtures of simulation and real-world data. Legend values indicate the amount of simulation data relative to a fixed quantity of real human data. The y-axis shows prediction error on held-out human interactions, with error bars denoting 95% confidence intervals.