Table of Contents
Fetching ...

Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following

Yueen Ma, Dafeng Chi, Shiguang Wu, Yuecheng Liu, Yuzheng Zhuang, Irwin King

TL;DR

Astra introduces a segment-level Transformer for embodied instruction following that leverages trajectory attention—allowing causal inter-segment and bidirectional intra-segment connections—and per-dimension learnable action queries to decode actions in parallel. A complementary contrastive dynamics learning objective encodes entire trajectories to strengthen environment dynamics modeling and cross-modal alignment, using positive samples from action perturbations and image augmentations and negatives from mismatched segments. The approach achieves substantial performance gains on VIMA-Bench, ManiSkill, and CALVIN, with ablations confirming the critical roles of trajectory attention, action queries, and CDL. The work demonstrates that segment-level processing and lightweight contrastive signals can significantly improve imitation learning in multimodal EIF tasks, with practical implications for efficient robotics transformers and potential integration with pretrained Large VLAs. In addition, Astra’s architecture remains compatible with various vision, language, and action encoders, enabling flexible deployment and future real-world extensions including 3D perception integration.

Abstract

Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance improvements over previous models.

Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following

TL;DR

Astra introduces a segment-level Transformer for embodied instruction following that leverages trajectory attention—allowing causal inter-segment and bidirectional intra-segment connections—and per-dimension learnable action queries to decode actions in parallel. A complementary contrastive dynamics learning objective encodes entire trajectories to strengthen environment dynamics modeling and cross-modal alignment, using positive samples from action perturbations and image augmentations and negatives from mismatched segments. The approach achieves substantial performance gains on VIMA-Bench, ManiSkill, and CALVIN, with ablations confirming the critical roles of trajectory attention, action queries, and CDL. The work demonstrates that segment-level processing and lightweight contrastive signals can significantly improve imitation learning in multimodal EIF tasks, with practical implications for efficient robotics transformers and potential integration with pretrained Large VLAs. In addition, Astra’s architecture remains compatible with various vision, language, and action encoders, enabling flexible deployment and future real-world extensions including 3D perception integration.

Abstract

Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance improvements over previous models.
Paper Structure (37 sections, 3 equations, 16 figures, 6 tables)

This paper contains 37 sections, 3 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Comparison of information flow in an action segment. Squares represent tokens, while orange dots represent their embeddings. Three action tokens comprise an action "segment". The lines illustrate information flow from input embeddings (bottom) to output embeddings (top) through a Transformer self-attention layer. In trajectory attention, tokens attend not only to preceding tokens, as in causal attention, but also to subsequent tokens within the same segment, as indicated by the green lines.
  • Figure 2: The architecture of Astra. A trajectory $\tau$ comprises a prompt segment $p_{1:4}$, state segments $s_{1:2,t}$, action segments $a_{1:3,t}$. Learnable action queries $q_{1:3,t}$ are inserted after state segments to extract information for action generation. Vertical dashed lines separate these segments. Token embeddings (orange dots) can attend to embeddings in all previous segments (thick horizontal arrows) and to all embeddings within the same segment (gray and green lines). Notably, action queries are hidden from other tokens and can only read from preceding tokens. To facilitate contrastive dynamics learning, Astra can also encode the entire trajectory by pooling the embeddings of the last segment (red box).
  • Figure 3: Attention matrices of causal and trajectory attention. The direction of attention is from the top (input) to the left (output). Dark cells represent attention masks. Green-bordered cells highlight additional information flow enabled by trajectory attention, corresponding to the green lines in Figure \ref{['fig:intro']}.
  • Figure 4: Contrastive dynamics learning. (a) In the anchor trajectory (blue arrow), the object on the right is picked up and placed into the bin on the left. A slightly deviated trajectory (green arrow) can still reach the desired destination, enabling action perturbation to be used in constructing positive samples. (b) Given the anchor, we construct a positive sample by applying image augmentation (aug.) and the proposed action perturbation. Negative samples are created by mismatching states and actions from other trajectories.
  • Figure 5: Loss and accuracy curves during training on VIMA-Bench.
  • ...and 11 more figures