CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation
Giovanni Minelli, Giulio Turrisi, Victor Barasuol, Claudio Semini
TL;DR
Imitation learning for robotic manipulation often fails under distribution shifts; this work presents State Transition Attention (STA), a cross-state attention mechanism that modulates attention based on learned state evolution patterns, combined with temporal masking to promote historical reasoning. The STA Transformer uses an encoder for world state and a decoder with STA cross-attention and self-attention, producing joint delta actions without requiring explicit planning. Across four ManiSkill tasks with recovery-rich demonstrations, STA consistently outperforms standard cross-attention and traditional temporal models, with notable gains in precision-critical tasks. The findings demonstrate robust temporal reasoning from history and suggest practical improvements for manipulation policies operating with partial or noisy perceptual input.
Abstract
Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard cross-attention and temporal modeling approaches like TCN and LSTM networks across all tasks, achieving more than 2x improvement over cross-attention on precision-critical tasks.
