Table of Contents
Fetching ...

MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer

Heng Zhi, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen

TL;DR

MOTIF is introduced for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data and designs a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy.

Abstract

While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at https://github.com/buduz/MOTIF.

MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer

TL;DR

MOTIF is introduced for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data and designs a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy.

Abstract

While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at https://github.com/buduz/MOTIF.
Paper Structure (35 sections, 19 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 35 sections, 19 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Concept and Performance of MOTIF.(Top) Motif-guided Transfer. MOTIF extracts embodiment-agnostic action motifs by aligning execution segments from heterogeneous robots (e.g., xArm6, Panda), bridging kinematic gaps for cross-embodiment transfer. The schematic illustrates how task behaviors learned by a source embodiment $E1$ on task $T1$ are adapted to a target ($E2+T1$). (Bottom Left) Simulation Results. MOTIF consistently outperforms strong baselines in Transfer Success Rate across all data regimes (1- to 50-shot). (Bottom Right) Real-world Results. Physical evaluations further validate this effectiveness, demonstrating significant improvements in both Transfer and Global success rates against SOTA methods.
  • Figure 2: Overview of the MOTIF framework.(Left) Stage I: We learn unified action motifs from heterogeneous robot data using VQ-VAE augmented with Progress-Aware Alignment and Embodiment Adversarial objectives to ensure cross-embodiment consistency. (Top Right) Stage II: A multimodal predictor infers these motifs from vision and language inputs using frozen foundation encoders. (Bottom Right) Stage III: Inferred motifs serve as structural guidance for a flow-matching policy, enabling a Diffusion Transformer (DiT) to generate embodiment-specific actions via few-shot transfer.
  • Figure 3: Architecture of the Latent Action Motif Learning Module. The encoder integrates progress-aware positional encodings (PE) and employs a local-attention Transformer with a sliding-window mask to capture local dynamics, followed by strided 1D convolution for temporal downsampling.
  • Figure 4: Overview of Simulation and Real-World Environments. We evaluate MOTIF across heterogeneous embodiments in both simulated (Left) and physical (Right) settings. The experiments follow an interleaved task allocation protocol, where red bounding boxes ( ) denote the target (few-shot) embodiment-task pairs used to assess cross-embodiment transfer capability. The remaining pairs serve as the source domain with full supervision.
  • Figure 5: Hardware Setup. We evaluate MOTIF on two heterogeneous single-arm robots: Piper (left) and ARX5 (right).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 4.1: Action Motifs