Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

Dikai Liu; Tianwei Zhang; Jianxiong Yin; Simon See

Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

Dikai Liu, Tianwei Zhang, Jianxiong Yin, Simon See

TL;DR

The paper addresses the challenge of generalizing quadruped locomotion policies across different robot models and sensor configurations. It introduces Masked Sensory-Temporal Attention (MSTA), a transformer-based approach that tokenizes sensor data at the lowest level, applies random masking during training, and uses a three-embedding scheme (sensor type, channel, time) to capture multimodal proprioception, with a two-stage teacher-student transfer for robust learning. Key contributions include sensor-level tokenization, a masking strategy to promote sensory-temporal understanding, and demonstrated robustness to missing or unseen data, including a successful 150 Hz zero-shot deployment on a physical robot with minimized observations. The work shows that MSTA can match or exceed baselines under incomplete data, supports flexible sensor configurations, and provides a scalable foundation for integrating high-dimensional inputs in future expansions, potentially reducing retraining needs across diverse quadrupeds.

Abstract

With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based mechanism with masking for quadruped locomotion. It employs direct sensor-level attention to enhance the sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. MSTA can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on physical systems despite the long input sequence.

Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 7 figures, 1 table)

This paper contains 14 sections, 5 equations, 7 figures, 1 table.

Introduction
Related Work
Sim-to-Real Policy Learning in Legged Locomotion
Transformer in Robotics
Preliminary
Simulation Environment
Teacher Policy and Training
Methodology
Experiments and Results
Impact of Mask Ratio
Comparison with Baselines
Generalization, Robustness and Flexibility
Physical Deployment
Conclusion

Figures (7)

Figure 1: Commonly seen low-level sensors on a quadrupedal robot. However, actual sensor set is still different across models, and sensor degradation can cause part of sensor data to be unreliable or even unavailable. With MSTA, we create a generalized model to enhance the understanding of sensor information to handle variable sensor input for quadruped locomotion.
Figure 2: Overview of our MSTA. We gather proprioceptive information from commonly seen low-level sensors for discretization and tokenization. Similar to video understanding, we add additional embedding in three dimensions: sensor type, sensor dim and time. Before being passed to the transformer, a random mask is applied to partially remove the information and a learnable state embedding <S> is used to consolidate the information for action prediction. The target joint position output is passed to the PD controller for direct joint control.
Figure 3: Heatmap matrix for the performance of models that are trained with different combinations of mask ratios. The three rows from top to bottom represent the linear velocity tracking, angular velocity tracking and total reward return respectively. The four columns denote different masking ratios applied during testing. For each sub-figure, the y-axis is the masking ratio applied during the offline pretraining stage and the x-axis is the masking ratio applied during the online correction stage.
Figure 4: Performance with certain sensory feedback completely removed.
Figure 5: Performance with various setups: Left certain numbers of joint encoders are masked out; Right different history time window $T$ is applied.
...and 2 more figures

Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

TL;DR

Abstract

Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)