Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion
Dikai Liu, Tianwei Zhang, Jianxiong Yin, Simon See
TL;DR
The paper addresses the challenge of generalizing quadruped locomotion policies across different robot models and sensor configurations. It introduces Masked Sensory-Temporal Attention (MSTA), a transformer-based approach that tokenizes sensor data at the lowest level, applies random masking during training, and uses a three-embedding scheme (sensor type, channel, time) to capture multimodal proprioception, with a two-stage teacher-student transfer for robust learning. Key contributions include sensor-level tokenization, a masking strategy to promote sensory-temporal understanding, and demonstrated robustness to missing or unseen data, including a successful 150 Hz zero-shot deployment on a physical robot with minimized observations. The work shows that MSTA can match or exceed baselines under incomplete data, supports flexible sensor configurations, and provides a scalable foundation for integrating high-dimensional inputs in future expansions, potentially reducing retraining needs across diverse quadrupeds.
Abstract
With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based mechanism with masking for quadruped locomotion. It employs direct sensor-level attention to enhance the sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. MSTA can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on physical systems despite the long input sequence.
