Online pre-training with long-form videos
Itsuki Kato, Kodai Kamiya, Toru Tamaki
TL;DR
This work addresses the challenge of pre-training for video understanding in an online setting by leveraging continuous long-form video clips. It compares three self-supervised learning approaches—Masked Image Modeling, contrastive learning, and teacher-student distillation—using unlabelled AVA-Actions for pretraining and HMDB51/UCF101 for downstream evaluation. The key finding is that contrastive learning delivers the strongest downstream action-recognition performance, with stride selection in sequential clips helping to mitigate overfitting. The results suggest that signals from long-form videos can transfer to short-video action recognition and provide guidance on clip-sampling strategies for online pre-training.
Abstract
In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.
