Table of Contents
Fetching ...

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi

TL;DR

Social-MAE introduces a transformer-based masked autoencoder that pre-trains on multi-person joint trajectories in the frequency domain using a sparse tube-masking strategy. By reconstructing masked trajectory tokens with a dedicated MAE encoder and a lightweight decoder, the model learns generalizable motion representations that transfer to downstream high-level social tasks such as multi-person pose forecasting, social grouping, and social action understanding. The approach achieves state-of-the-art results across four datasets and multiple tasks, demonstrating strong data efficiency and robustness even when fine-tuning from limited annotated data. This work provides a principled self-supervised pre-training route for pose-dependent social reasoning without relying on large-scale labeled video data.

Abstract

For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

TL;DR

Social-MAE introduces a transformer-based masked autoencoder that pre-trains on multi-person joint trajectories in the frequency domain using a sparse tube-masking strategy. By reconstructing masked trajectory tokens with a dedicated MAE encoder and a lightweight decoder, the model learns generalizable motion representations that transfer to downstream high-level social tasks such as multi-person pose forecasting, social grouping, and social action understanding. The approach achieves state-of-the-art results across four datasets and multiple tasks, demonstrating strong data efficiency and robustness even when fine-tuning from limited annotated data. This work provides a principled self-supervised pre-training route for pose-dependent social reasoning without relying on large-scale labeled video data.

Abstract

For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.
Paper Structure (12 sections, 6 equations, 6 figures, 7 tables)

This paper contains 12 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: On the left, the initial tokens to the social-MAE are joints' trajectory of all people in the scene. We randomly mask a subset of input tokens. The encoder performs on a small subset of input tokens (unmasked tokens only). The small decoder is then applied on the latent representation of unmasked tokens as well as the masked ones to reconstruct the original input joints' trajectory. On the right and after pre-training, the decoder is replaced with a task-specific decoder for each downstream task and the pre-trained encoder is applied on the full-set of multi-person joints' trajectory.
  • Figure 2: Ablation analysis to evaluate the performance error (VIM) in relation to the number of available annotated data splits. We examined two scenarios: 1) training our baseline model from scratch (S). 2) Fine-tuning our S-MAE model (P). Note that S-MAE is pre-trained on all the data splits in the self-supervised manner. VIM is reported at each timestep on SoMoF validation set. Overall is the average of reported VIMs at different timesteps.
  • Figure 3: Social grouping and action understanding on JRDB-Act validation set. The top sub-figure indicates the S-MAE predictions and the bottom sub-figure shows the ground-truth. Social groups are indicated by bounding box colors and action labels are indicated by circles on left side of each box. Note that the same social group in the prediction and ground-truth sub-figures could have different colors.
  • Figure 4: 3 different samples of social grouping and action understanding on JRDB-Act validation set. The top sub-figure indicates the S-MAE predictions and the bottom sub-figure shows the ground-truth in each sample. Social groups are indicated by bounding box colors and action labels are indicated by circles on left side of each box. Note that the same social group in the prediction and ground-truth sub-figures could have different colors.
  • Figure 5: A failure case of S-MAE prediction in the social grouping task. Since S-MAE relies on multi-person joints trajectory to predict the social groups, it predicts people that have similar joints movement and are close to each other in 2D as one social group.
  • ...and 1 more figures