Table of Contents
Fetching ...

MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning

Rex Liu, Xin Liu

TL;DR

Mu-MAE tackles the challenge of one-shot multimodal human activity recognition by removing the need for external pretraining data. It combines multimodal masked autoencoders with synchronized masking across wearable sensors and a cross-attention fusion mechanism to produce a rich multimodal representation, $R^{m}$, used by a model-agnostic one-shot classifier. The approach achieves state-of-the-art performance on MMAct, notably 80.17% for five-way 1-shot without extra data and up to 83.82% with external data, while ablations confirm the importance of synchronized masking and cross-attention. This work reduces annotation costs and data dependencies, offering a scalable blueprint for reliable in-domain self-supervised pretraining in multimodal HAR systems.

Abstract

With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as prior information input to a cross-attention multimodal fusion layer. This fusion layer emphasizes spatiotemporal features requiring attention across different modalities while highlighting differences from other classes, aiding in the classification of various classes in metric-based one-shot learning. Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE outperforms all the evaluated approaches, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, without the use of additional data.

MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning

TL;DR

Mu-MAE tackles the challenge of one-shot multimodal human activity recognition by removing the need for external pretraining data. It combines multimodal masked autoencoders with synchronized masking across wearable sensors and a cross-attention fusion mechanism to produce a rich multimodal representation, , used by a model-agnostic one-shot classifier. The approach achieves state-of-the-art performance on MMAct, notably 80.17% for five-way 1-shot without extra data and up to 83.82% with external data, while ablations confirm the importance of synchronized masking and cross-attention. This work reduces annotation costs and data dependencies, offering a scalable blueprint for reliable in-domain self-supervised pretraining in multimodal HAR systems.

Abstract

With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as prior information input to a cross-attention multimodal fusion layer. This fusion layer emphasizes spatiotemporal features requiring attention across different modalities while highlighting differences from other classes, aiding in the classification of various classes in metric-based one-shot learning. Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE outperforms all the evaluated approaches, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, without the use of additional data.
Paper Structure (14 sections, 13 equations, 2 figures, 3 tables)

This paper contains 14 sections, 13 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the Multimodal Masked Autoencoders-Based One-Shot Learning (MU-MAE), involving a video modality and two time series modalities. The MU-MAE framework involves two steps. In the first step, known as the pretraining process, a tube masking strategy is employed, and then we get representations of unmasked video data, inspired by the VideoMAE framework videomae. Simultaneously, a synchronized masking strategy is applied to the other two physical sensor modalities. This synchronized masking strategy entails masking all time series data at the same specific time points. The concatenated representation, including position information, is then fed into the encoder module to produce the multimodal encoder representation. Subsequently, individual decoders are trained for each modality using mean square error loss to reconstruct the respective modality data. The second step involves a finetuning process focused on one-shot multimodal classification. Unimodal feature encoders pretrained in the pretraining process are applied to extract unimodal representations. The unimodal representations and the multimodal encoder representations are fed into the cross attention multimodal fusion module. This process produces the multimodal representation, which is then directed into the model-agnostic one-shot learning module for classification. More details can be found in Section \ref{['s:method']}.
  • Figure 2: Cross Attention Multimodal Fusion Module. $R^m_{encoder}$ is the multimodal representation from multimodal masked autoencoders' encoder.