Data Collection-free Masked Video Modeling

Yuchi Ishikawa; Masayoshi Kondo; Yoshimitsu Aoki

Data Collection-free Masked Video Modeling

Yuchi Ishikawa, Masayoshi Kondo, Yoshimitsu Aoki

TL;DR

This work tackles the data, privacy, and licensing challenges of pre-training video transformers by introducing a self-supervised framework that generates pseudo-motion videos from static images using a Pseudo Motion Generator (PMG). The PMG-produced videos feed masked video modeling via VideoMAE, enabling effective spatio-temporal representation learning without real video data, and can also augment real-video pre-training. Across ablations and benchmarks, the approach outperforms static-image baselines and partially rivals methods using real or synthetic videos, while reducing data collection concerns. The findings shed light on what video transformers learn from masked video modeling and highlight the value of motion diversity and low-level feature learning for transferability.

Abstract

Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.

Data Collection-free Masked Video Modeling

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 10 figures, 16 tables)

This paper contains 34 sections, 2 equations, 10 figures, 16 tables.

Introduction
Related Work
Proposed Method
Overview of Our Self-supervised Framework
Pseudo Motion Generator (PMG)
Combination of Our Framework with Synthetic Images
Experiments
Ablation Studies
The effect of image augmentations.
The combination of image augmentations.
The efficacy of video-level augmentations.
Transferability of Our Framework
Transferability from other video datasets.
Transferability from real image datasets.
Transferability from synthetic images.
...and 19 more sections

Figures (10)

Figure 1: Overview of our proposed framework.
Figure 2: Examples of pseudo-motion videos. Images are sampled from PASS asano2021pass. For more examples of pseudo-motion videos, see the supplementary material.
Figure 3: Effect of the number of epochs, data, categories.
Figure 4: Performance when the number of video data for finetuning is limited.
Figure 5: Python pseudo-code for Pseudo Motion Generator (PMG).
...and 5 more figures

Data Collection-free Masked Video Modeling

TL;DR

Abstract

Data Collection-free Masked Video Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (10)