DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

Ke Zhang; Lixin Xu; Chengyi Song; Junzhe Xu; Xiaoyi Lin; Zeyu Jiang; Renjing Xu

DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

Ke Zhang, Lixin Xu, Chengyi Song, Junzhe Xu, Xiaoyi Lin, Zeyu Jiang, Renjing Xu

TL;DR

DexFormer tackles cross-embodiment dexterous manipulation by learning a single morphology-agnostic policy that conditions on observation history. It employs a history-driven Transformer with a shared canonical action space and a large-scale morphology-randomization training pipeline to implicitly infer embodiment dynamics without explicit morphology identifiers. The approach achieves strong zero-shot generalization to unseen canonical hands (and their variants) and scales with parallelism and history length, with positive real-world transfer demonstrated on a LEAP hand. This work offers a scalable foundation for foundation-style, cross-embodiment manipulation policies that extend dexterity across diverse hardware without embodiment-specific heads or retargeting.

Abstract

Dexterous manipulation remains one of the most challenging problems in robotics, requiring coherent control of high-DoF hands and arms under complex, contact-rich dynamics. A major barrier is embodiment variability: different dexterous hands exhibit distinct kinematics and dynamics, forcing prior methods to train separate policies or rely on shared action spaces with per-embodiment decoder heads. We present DexFormer, an end-to-end, dynamics-aware cross-embodiment policy built on a modified transformer backbone that conditions on historical observations. By using temporal context to infer morphology and dynamics on the fly, DexFormer adapts to diverse hand configurations and produces embodiment-appropriate control actions. Trained over a variety of procedurally generated dexterous-hand assets, DexFormer acquires a generalizable manipulation prior and exhibits strong zero-shot transfer to Leap Hand, Allegro Hand, and Rapid Hand. Our results show that a single policy can generalize across heterogeneous hand embodiments, establishing a scalable foundation for cross-embodiment dexterous manipulation. Project website: https://davidlxu.github.io/DexFormer-web/.

DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 15 figures, 6 tables)

This paper contains 22 sections, 10 equations, 15 figures, 6 tables.

Introduction
Related Work
Cross-embodiment dexterous manipulation
Dynamics-aware manipulation
Test-time adaptation
Method
Problem Formulation
Shared Action Space
Embodiment Generation
History-Conditioned Transformer
Observation Space
Reward Design
Parallelism for Cross-Embodiment Training
Experiments
Implementation Details
...and 7 more sections

Figures (15)

Figure 1: DexFormer learns a unified history-conditioned policy that transfers dexterous grasping across diverse hand embodiments, enabling zero-shot deployment from large-scale simulation to real-world robots.
Figure 2: Shared action space. (a) The canonical action space (left) defines a morphology-invariant embedding in which joints with the same anatomical function share fixed indices in the action space. MCP (blue/orange) governs flexion and abduction, PIP/DIP (green) provide proximal/distal flexion, while the thumb uses a distinct structure where CMC (orange/purple) supports abduction and opposition, MCP (blue) controls basal flexion, and IP (green) provides distal flexion. (b) Canonical embedding flattened into 20-Dim space. Lower-DoF hands such as LEAP and Allegro hands zero-pad unused canonical dimensions, where as higher-DoF embodiments like Rapid Hand fully populate the embedding, enabling shared control across heterogeneous embodiments.
Figure 3: History-conditioned transformer policy architecture. At each timestep, a fixed-length history of observations with horizon $H$ is provided as input and tokenized into a sequence of $H$ tokens. The token sequence is processed by a stack of three transformer layers with positional encoding and causal self-attention. The representation of the final embedding, which attends to all preceding history, is extracted and passed to an MLP action head to parameterize a stochastic policy. Actions are then sampled from the resulting distribution for execution.
Figure 4: Gradient aggregation during distributed training. While rollouts and forward passes are computed locally on independent GPUs, the all-reduce primitive aggregates gradients during backpropagation, averaging parameter updates to ensure the DexFormer weights remain identical across all devices.
Figure 5: Diversified embodiment generation. We synthesize 100 variants per canonical LEAP, Allegro, and RAPID hand for training, and 32 variants per canonical hand for testing. Canonical hands are held out during training and evaluated zero-shot.
...and 10 more figures

DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

TL;DR

Abstract

DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (15)