Table of Contents
Fetching ...

Online pre-training with long-form videos

Itsuki Kato, Kodai Kamiya, Toru Tamaki

TL;DR

This work addresses the challenge of pre-training for video understanding in an online setting by leveraging continuous long-form video clips. It compares three self-supervised learning approaches—Masked Image Modeling, contrastive learning, and teacher-student distillation—using unlabelled AVA-Actions for pretraining and HMDB51/UCF101 for downstream evaluation. The key finding is that contrastive learning delivers the strongest downstream action-recognition performance, with stride selection in sequential clips helping to mitigate overfitting. The results suggest that signals from long-form videos can transfer to short-video action recognition and provide guidance on clip-sampling strategies for online pre-training.

Abstract

In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.

Online pre-training with long-form videos

TL;DR

This work addresses the challenge of pre-training for video understanding in an online setting by leveraging continuous long-form video clips. It compares three self-supervised learning approaches—Masked Image Modeling, contrastive learning, and teacher-student distillation—using unlabelled AVA-Actions for pretraining and HMDB51/UCF101 for downstream evaluation. The key finding is that contrastive learning delivers the strongest downstream action-recognition performance, with stride selection in sequential clips helping to mitigate overfitting. The results suggest that signals from long-form videos can transfer to short-video action recognition and provide guidance on clip-sampling strategies for online pre-training.

Abstract

In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.
Paper Structure (17 sections, 1 equation, 1 figure)

This paper contains 17 sections, 1 equation, 1 figure.

Figures (1)

  • Figure 1: Top-1 performance of the downstream tasks with \ref{['Fig:MIM_rec_val']}\ref{['Fig:MIM_rec_val_UCF']} MIM, \ref{['Fig:moco_rec_val']}\ref{['Fig:moco_rec_val_UCF']} contrastive learning, and \ref{['Fig:TSL_rec_val']}\ref{['Fig:TSL_rec_val_UCF']} techer-student learnig. left: HMDB51, right: UCF101.