Table of Contents
Fetching ...

Learning Video Representations without Natural Videos

Xueyang Yu, Xinlei Chen, Yossi Gandelsman

TL;DR

This work proposes a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties, and identifies correlations between frame diversity, frame similarity to natural data, and downstream performance.

Abstract

We show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g., motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2\% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

Learning Video Representations without Natural Videos

TL;DR

This work proposes a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties, and identifies correlations between frame diversity, frame similarity to natural data, and downstream performance.

Abstract

We show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g., motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2\% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

Paper Structure

This paper contains 24 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Samples from our progression of video generation models and additionally included image datasets. We present 4 frames from timestamps $t \in \{0,10,20,30\}$ of a randomly sampled video from each of our generated datasets, and UCF101 (left to right).
  • Figure 2: Action recognition accuracy on UCF101. We present the UCF101 classification accuracy of the progression of models $\{M_i\}$, after fine-tuning each of them on UCF101. The accuracy increases along the progression.
  • Figure 3: Distribution Shift results on UCF101-P robustness2022large (ViT-B) The last model in our progression outperforms pre-training on natural videos for 11 out of 14 corruption datasets.
  • Figure 4: Dataset properties compared to downstream performance. We compare the downstream classification accuracy on UCF101 after fine-tuning to frame and video properties of all the dataset variants we used in our analysis (see datasets list in \ref{['appendix:data']}).
  • Figure 5: Feature visualizations for pre-trained models. We present the 3 principal components of the attention keys of the last encoder layer, for all $M_i$ as the three color channels. Different object parts start to appear as the datasets progress.