Table of Contents
Fetching ...

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Mingfang Zhang, Yifei Huang, Ruicong Liu, Yoichi Sato

TL;DR

Addressing egocentric action recognition with synchronized video and body-worn IMU signals, the paper proposes EgoVideoIMU-MAE (EVI-MAE) that uses a multimodal Masked Autoencoder and a graph-based IMU representation to learn aligned, robust features from unlabeled data. It pretrains with two branches using inputs $D_v \in \mathbb{R}^{T \times \mathcal{S}_v \times H \times W \times 3}$ and $D_{raw} \in \mathbb{R}^{N_{imu} \times T \times \mathcal{S}_{imu} \times 3}$, and a joint loss $L = \alpha L_{mse} + \beta L_{cos} + \gamma L_{con}$ to capture cross-modal correlations. Finetuning on action classification using concatenated encoders yields state-of-the-art results on CMU-MMAC and WEAR and demonstrates robustness to partial IMU devices and video quality variation, highlighting practical applicability in real-world deployments. The work advances multimodal representation learning by combining MAE with a graph-based IMU model for flexible, resilient egocentric action recognition.

Abstract

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

TL;DR

Addressing egocentric action recognition with synchronized video and body-worn IMU signals, the paper proposes EgoVideoIMU-MAE (EVI-MAE) that uses a multimodal Masked Autoencoder and a graph-based IMU representation to learn aligned, robust features from unlabeled data. It pretrains with two branches using inputs and , and a joint loss to capture cross-modal correlations. Finetuning on action classification using concatenated encoders yields state-of-the-art results on CMU-MMAC and WEAR and demonstrates robustness to partial IMU devices and video quality variation, highlighting practical applicability in real-world deployments. The work advances multimodal representation learning by combining MAE with a graph-based IMU model for flexible, resilient egocentric action recognition.

Abstract

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.
Paper Structure (33 sections, 4 equations, 5 figures, 4 tables)

This paper contains 33 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our EgoVideoIMU-MAE (EVI-MAE). Because of the scarcity of labeled multimodal data, we propose an MAE-based pretraining approach with unlabeled egocentric video and IMU signals. To exploit the collaborative dynamics in multiple IMU devices, we propose to embed the relative motion features of human joints into a graph. In the finetuning and evaluation phase, we consider potential video and IMU corruption for more flexible usages.
  • Figure 2: IMU and video data preprocessing and masking.
  • Figure 3: Our EVI-MAE pretraining network processes video patches $\boldsymbol{v}$ and IMU spectrogram patches $\boldsymbol{i}$ and incorporates them into two branches, a multimodal pixel reconstruction branch, and an IMU feature reconstruction branch.
  • Figure 4: Visual degradation challenge. We employ a sophisticated method to synthesize low-light effect and degrade the signal-to-noise ratio on the video input. In such cases, our multimodal model switches focus to IMU modality for robust performance. In contrast, simple multimodal feature concatenation (W+ bock2023wear) achieves suboptimal results.
  • Figure 5: Visualization instances where our multimodal approach successfully recognizes actions, whereas the VideoMAE tong2022videomae model failed.