Table of Contents
Fetching ...

Structural Action Transformer for 3D Dexterous Manipulation

Xiaohan Lei, Min Wang, Bohong Weng, Wengang Zhou, Houqiang Li

TL;DR

This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective, and offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

Abstract

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

Structural Action Transformer for 3D Dexterous Manipulation

TL;DR

This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective, and offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

Abstract

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.
Paper Structure (20 sections, 2 equations, 8 figures, 6 tables)

This paper contains 20 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Conceptual illustration of action chunk tokenization. (a) The conventional temporal-centric perspective, which structures actions as a sequence of $T$ timesteps (chunk length), with each token having dimension $D_a$ (action dim). (b) Our proposed structural-centric perspective, which reframes the action chunk as a sequence of $D_a$ joints, where each token's feature is its temporal trajectory over $T$. This $(D_a, T)$ view naturally handles heterogeneous embodiments as a variable-length, unordered sequence, which is a key feature of our approach.
  • Figure 2: Our proposed model architecture. The policy takes a history of $T_o$ raw 3D point clouds $\mathcal{P}_t = (\mathbf{P}_{t-T_o+1}, \dots, \mathbf{P}_t)$ and a language instruction $L$ as input. Observation Tokenizer: Each point cloud $\mathbf{P}_k$ in the history is processed via Farthest Point Sampling (FPS) and PointNets to extract local geometric tokens and a global scene context. The tokens from each time step are concatenated to form the final observation token sequence. Language is encoded by a T5 tokenizer 2020t5. Structural Action Tokenizer: Guided by the manipulator’s morphology, the Embodied Joint Codebook produces structural-centric embeddings aligned with the action dimension $D_a$, which are added to the time-stepped noisy tokens $\mathbf{A}_t^\tau$. Structural Action Transformer: A DiT peebles2023scalable with causal masking predicts the action velocity field. This field is then integrated via an ODE solver to produce the final action chunk $\mathbf{A}_t$.
  • Figure 3: Composition of the offline pre-training dataset. The pie chart illustrates the relative data scale of each of the constituent datasets liu2022hoi4dgrauman2024egopan2023ariafourier2025actionnetwang2024dexcaprajeswaran2017learningbao2023dexartchen2023bi.
  • Figure 4: Few-shot adaptation efficiency. We plot the average success rate versus training epochs for our method and the UniAct zheng2025universal baseline, evaluated in few-shot settings using varying numbers of in-domain demonstrations.
  • Figure 5: Frequency analysis of joint types in our Embodied Joint Codebook, derived from a survey of 10 common dexterous hands.
  • ...and 3 more figures