Table of Contents
Fetching ...

CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation

Giovanni Minelli, Giulio Turrisi, Victor Barasuol, Claudio Semini

TL;DR

Imitation learning for robotic manipulation often fails under distribution shifts; this work presents State Transition Attention (STA), a cross-state attention mechanism that modulates attention based on learned state evolution patterns, combined with temporal masking to promote historical reasoning. The STA Transformer uses an encoder for world state and a decoder with STA cross-attention and self-attention, producing joint delta actions without requiring explicit planning. Across four ManiSkill tasks with recovery-rich demonstrations, STA consistently outperforms standard cross-attention and traditional temporal models, with notable gains in precision-critical tasks. The findings demonstrate robust temporal reasoning from history and suggest practical improvements for manipulation policies operating with partial or noisy perceptual input.

Abstract

Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard cross-attention and temporal modeling approaches like TCN and LSTM networks across all tasks, achieving more than 2x improvement over cross-attention on precision-critical tasks.

CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation

TL;DR

Imitation learning for robotic manipulation often fails under distribution shifts; this work presents State Transition Attention (STA), a cross-state attention mechanism that modulates attention based on learned state evolution patterns, combined with temporal masking to promote historical reasoning. The STA Transformer uses an encoder for world state and a decoder with STA cross-attention and self-attention, producing joint delta actions without requiring explicit planning. Across four ManiSkill tasks with recovery-rich demonstrations, STA consistently outperforms standard cross-attention and traditional temporal models, with notable gains in precision-critical tasks. The findings demonstrate robust temporal reasoning from history and suggest practical improvements for manipulation policies operating with partial or noisy perceptual input.

Abstract

Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard cross-attention and temporal modeling approaches like TCN and LSTM networks across all tasks, achieving more than 2x improvement over cross-attention on precision-critical tasks.

Paper Structure

This paper contains 17 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Performance comparison on simulated manipulation tasks when training with successful-only demonstrations (left) versus recovery-rich demonstrations (right). Our State Transition Attention (STA) mechanism shows particular effectiveness at exploiting temporal patterns in recovery-rich data, achieving superior performance compared to standard temporal modeling approaches.
  • Figure 2: Graphical representation of how cross-attention works within the Transformer architecture adopted. Here, the query (Q) tokens represent joint values, while the keys (K), values (V) and state (S) tokens encode the overall system information. On the left, a) illustrates standard cross-attention and, on the right, b) depicts State Transition Attention.
  • Figure 3: Architecture overview of our proposed Transformer with STA. The encoder processes visual observations through CNN and proprioceptive data through MLP to generate state tokens. The decoder employs standard self-attention for input token interactions (white squares) and our novel STA module as cross-attention with current and historical state tokens (colored squares). Both decoder input's tokens and encoder state-related tokens are being cached for reuse in later steps.
  • Figure 4: ManiSkill manipulation tasks used for evaluation. (a) StackCube: Single-arm manipulation requiring coordinated grasping and placement; (b) PegInsertionSide: Precision insertion task demanding correct orientation and alignment of the peg with the box hole slot; (c) TwoRobotStackCube: Bimanual coordination task for collaborative cube stacking in a target location; (d) UnitreeG1TransportBox: Multi-joint coordination task involving arm and torso coordination in a humanoid robot to transport a box across the workspace.
  • Figure 5: Success rate comparison across four ManiSkill mu2021maniskill manipulation tasks. All methods were trained for 50 epochs on recovery-rich demonstrations with periodic validation performed directly in the simulation environment. Results represent the best validation checkpoint performance averaged over 3 seeds with 100 episodes per evaluation. Variance between seeds was negligible and is omitted for clarity.
  • ...and 3 more figures