Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning
Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi
TL;DR
Social-MAE introduces a transformer-based masked autoencoder that pre-trains on multi-person joint trajectories in the frequency domain using a sparse tube-masking strategy. By reconstructing masked trajectory tokens with a dedicated MAE encoder and a lightweight decoder, the model learns generalizable motion representations that transfer to downstream high-level social tasks such as multi-person pose forecasting, social grouping, and social action understanding. The approach achieves state-of-the-art results across four datasets and multiple tasks, demonstrating strong data efficiency and robustness even when fine-tuning from limited annotated data. This work provides a principled self-supervised pre-training route for pose-dependent social reasoning without relying on large-scale labeled video data.
Abstract
For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.
