Table of Contents
Fetching ...

Embedding Morphology into Transformers for Cross-Robot Policy Learning

Kei Suzuki, Jing Liu, Ye Wang, Chiori Hori, Matthew Brand, Diego Romeres, Toshiaki Koike-Akino

TL;DR

This work proposes an embodiment-aware transformer policy that injects morphology via three mechanisms: kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges.

Abstract

Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.

Embedding Morphology into Transformers for Cross-Robot Policy Learning

TL;DR

This work proposes an embodiment-aware transformer policy that injects morphology via three mechanisms: kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges.

Abstract

Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.
Paper Structure (49 sections, 16 equations, 8 figures, 10 tables)

This paper contains 49 sections, 16 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Embedding robot morphology into a Transformer-based VLA policy: We embed kinematic topology and per-joint semantics into the action policy. This design consistently improves cross-robot policy learning.
  • Figure 2: Embodiment-aware Transformer Policy: Morphology is embedded via three mechanisms: (1) kinematic tokens; (2) topology-aware attention bias; and (3) joint-attribute conditioning.
  • Figure 3: Simulation environments for evaluation: All environments are evaluated language-conditioned pick-and-place manipulation.
  • Figure 4: Multi-embodiment learning curves on Panda--SO101: We jointly train a single policy on a mixed Panda (DROID) and SO101 dataset. We report Macro SR, defined as $(\mathrm{SR}_{\text{Panda}}+\mathrm{SR}_{\text{SO101}})/2$. Our full model outperforms the $\pi_{0.5}$ baseline throughout training. For completeness, per-embodiment success rates are reported in the appendix \ref{['app:multi']}. Shaded regions indicate 95% confidence intervals.
  • Figure 5: Attention mask: Attention mask used in our VLA action policy with kinematic tokens. Dark cells indicate unmasked attention and light cells indicate masked attention. Tokens are grouped by type for visualization (image/prompt/action/kinematic; each group may contain multiple tokens). We append kinematic tokens and apply a topology-aware mask only in the joint-to-joint block to encode kinematic connectivity, while all other attention patterns follow the base $\pi_{0.5}$ mask.
  • ...and 3 more figures