Table of Contents
Fetching ...

MV2MAE: Multi-View Video Masked Autoencoders

Ketul Shah, Robert Crandall, Jie Xu, Peng Zhou, Marian George, Mayank Bansal, Rama Chellappa

TL;DR

MV2MAE introduces a multi-view masked autoencoder framework that leverages synchronized multi-view videos to learn geometry-aware, view-invariant representations. It adds a cross-view decoder with cross-attention to reconstruct a target viewpoint from a source view, while a motion-weighted reconstruction loss emphasizes learning from moving regions to improve temporal modeling. The approach achieves state-of-the-art results on NTU-60, NTU-120, and ETRI in pre-training plus finetuning and demonstrates strong transfer learning performance on NUCLA, PKU-MMD-II, and ROCOG-v2, highlighting robustness to viewpoint changes. This work advances self-supervised video learning by fusing cross-view geometry with efficient MAE-based pretraining, offering practical benefits for multi-view action recognition and related tasks.

Abstract

Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from synchronized multi-view videos. We use a cross-view reconstruction task to inject geometry information in the model. Our approach is based on the masked autoencoder (MAE) framework. In addition to the same-view decoder, we introduce a separate cross-view decoder which leverages cross-attention mechanism to reconstruct a target viewpoint video using a video from source viewpoint, to help representations robust to viewpoint changes. For videos, static regions can be reconstructed trivially which hinders learning meaningful representations. To tackle this, we introduce a motion-weighted reconstruction loss which improves temporal modeling. We report state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets, as well as in the transfer learning setting on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness of our approach. Code will be made available.

MV2MAE: Multi-View Video Masked Autoencoders

TL;DR

MV2MAE introduces a multi-view masked autoencoder framework that leverages synchronized multi-view videos to learn geometry-aware, view-invariant representations. It adds a cross-view decoder with cross-attention to reconstruct a target viewpoint from a source view, while a motion-weighted reconstruction loss emphasizes learning from moving regions to improve temporal modeling. The approach achieves state-of-the-art results on NTU-60, NTU-120, and ETRI in pre-training plus finetuning and demonstrates strong transfer learning performance on NUCLA, PKU-MMD-II, and ROCOG-v2, highlighting robustness to viewpoint changes. This work advances self-supervised video learning by fusing cross-view geometry with efficient MAE-based pretraining, offering practical benefits for multi-view action recognition and related tasks.

Abstract

Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from synchronized multi-view videos. We use a cross-view reconstruction task to inject geometry information in the model. Our approach is based on the masked autoencoder (MAE) framework. In addition to the same-view decoder, we introduce a separate cross-view decoder which leverages cross-attention mechanism to reconstruct a target viewpoint video using a video from source viewpoint, to help representations robust to viewpoint changes. For videos, static regions can be reconstructed trivially which hinders learning meaningful representations. To tackle this, we introduce a motion-weighted reconstruction loss which improves temporal modeling. We report state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets, as well as in the transfer learning setting on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness of our approach. Code will be made available.
Paper Structure (20 sections, 2 equations, 6 figures, 17 tables, 1 algorithm)

This paper contains 20 sections, 2 equations, 6 figures, 17 tables, 1 algorithm.

Figures (6)

  • Figure 1: Multi-View Video Masked Autoencoder (MV2MAE). Source and target viewpoint videos are tokenized, the tokens are masked using random masking, and visible tokens are encoded for each view using a shared encoder. The cross-view decoder uses source view tokens to reconstruct the target viewpoint, whereas the standard decoder reconstructs each view separately.
  • Figure 2: Motion weights. Each row shows motion weights overlaid on the input frames for different temperature values. Higher temperature increases the weight on static regions.
  • Figure 3: Temperature parameter of motion weights modulates the focus on static vs moving regions as visualized in Figure \ref{['fig:motion-weights-examples']}.
  • Figure 4: Cross-View Decoder Qualitative Analysis. We visualize the reconstructions and cross-attention maps from the cross-view decoder. First row: Target viewpoint input frames, Second row: Masked input frames from target viewpoint, Third row: Reconstruction of target view from the cross-decoder, Last row: Cross-attention maps visualized on source view frames. The red circle indicates the query token whose attention maps are visualized. green circles shows that the model is able find matching regions across viewpoints.
  • Figure 5: Example of synced views from ETRI dataset.
  • ...and 1 more figures