Table of Contents
Fetching ...

VideoMAC: Video Masked Autoencoders Meet ConvNets

Gensheng Pei, Tao Chen, Xiruo Jiang, Huafeng Liu, Zeren Sun, Yazhou Yao

TL;DR

VideoMAC tackles the problem that masked video modeling (MVM) methods largely rely on resource-intensive ViTs and struggle with dense tasks. It proposes a ConvNet-based MVM framework that uses sparse convolutions, symmetric frame-pair masking, and an online–target EMA architecture with a reconstruction-consistency loss to enforce temporal coherence. The method demonstrates superior performance on video downstream tasks (e.g., video object segmentation, body part propagation, human pose tracking) and shows competitive image recognition after video pretraining, highlighting the viability of hierarchical ConvNets for MVM. This approach reduces computational cost while delivering strong, transferable video representations, suggesting a promising direction for ConvNet-based pre-training in video analysis.

Abstract

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).

VideoMAC: Video Masked Autoencoders Meet ConvNets

TL;DR

VideoMAC tackles the problem that masked video modeling (MVM) methods largely rely on resource-intensive ViTs and struggle with dense tasks. It proposes a ConvNet-based MVM framework that uses sparse convolutions, symmetric frame-pair masking, and an online–target EMA architecture with a reconstruction-consistency loss to enforce temporal coherence. The method demonstrates superior performance on video downstream tasks (e.g., video object segmentation, body part propagation, human pose tracking) and shows competitive image recognition after video pretraining, highlighting the viability of hierarchical ConvNets for MVM. This approach reduces computational cost while delivering strong, transferable video representations, suggesting a promising direction for ConvNet-based pre-training in video analysis.

Abstract

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} ), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).
Paper Structure (11 sections, 3 equations, 8 figures, 5 tables)

This paper contains 11 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: State-of-the-art MAE architectures (e.g., MAE-ST mae_st and SiamMAE gupta2023siamese) in masked video modeling commonly employ ViT-based encoders. We propose VideoMAC, a new video masked autoencoder built using pure ConvNets. In this study, we model VideoMAC with two of the most emblematic families of ConvNets, namely ConvNeXtv2 woo2023convnext and ResNet he2016deep. Notably, VideoMAC exhibits superior performance on a range of downstream tasks, e.g., video object segmentation on DAVIS17 davis17, body part propagation on VIP zhou2018adaptive, and human pose tracking on JHMDB jhuang2013towards, compared to ViT-based methods mae_stgupta2023siamese.
  • Figure 2: Visualization of the heatmap during the reconstruction of masked patches in a random frame pair. It is evident that our approach highlights similar regions for both (a) past and (b) current frames, proficiently reconstructing colors and contours.
  • Figure 3: An illustration of VideoMAC for ConvNet-based MVM. During pre-training, we mask 75% of symmetric patches from two frames randomly. In our VideoMAC, the MVM of frame pairs is achieved by an online network optimized by gradients ($\blacksquare$, online loss $\mathcal{L}_{o}$) and a target network updated by EMA ($\blacksquare$, target loss $\mathcal{L}_{t}$). $\mathcal{L}_{c}$ is computed as the reconstruction consistency loss between reconstructed patches of frame pairs.
  • Figure 4: Qualitative results of our VideoMAC (using CNXv2-S) for three video downstream tasks: (a) video object segmentation on DAVIS17 davis17), (b) body part propagation on VIP zhou2018adaptive, and (c) human pose tracking on JHMDB jhuang2013towards.
  • Figure 5: For masked modeling, dense convolution usually results in the dissipation of mask structures. The deployment of sparse convolution proves to be an effective solution, enabling ConvNet-based MIM / MVM.
  • ...and 3 more figures