Table of Contents
Fetching ...

Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Xiaohua Zhai, Neil Houlsby, Sylvain Gelly, Mario Lucic

TL;DR

The paper tackles the challenge of learning transferable visual representations with limited labeled data. It introduces VIVI, a self-supervised framework that leverages a video hierarchy—frame-, shot-, and video-level invariances—to learn robust image representations from YouTube-8M videos, without requiring optical-flow or tracking. By combining frame/shot-level losses with video-level prediction tasks and optionally co-training with labeled images, VIVI achieves state-of-the-art transfer on 19 VTAB tasks with only 1000 labels per task and surpasses an ImageNet-pretrained ResNet-50 with 10x fewer labeled images when co-trained with ImageNet data. The approach yields strong, data-efficient transfer performance and shows modest robustness gains to video perturbations, with future work aimed at deeper robustness and task-perturbation understanding.

Abstract

We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set.

Self-Supervised Learning of Video-Induced Visual Invariances

TL;DR

The paper tackles the challenge of learning transferable visual representations with limited labeled data. It introduces VIVI, a self-supervised framework that leverages a video hierarchy—frame-, shot-, and video-level invariances—to learn robust image representations from YouTube-8M videos, without requiring optical-flow or tracking. By combining frame/shot-level losses with video-level prediction tasks and optionally co-training with labeled images, VIVI achieves state-of-the-art transfer on 19 VTAB tasks with only 1000 labels per task and surpasses an ImageNet-pretrained ResNet-50 with 10x fewer labeled images when co-trained with ImageNet data. The approach yields strong, data-efficient transfer performance and shows modest robustness gains to video perturbations, with future work aimed at deeper robustness and task-perturbation understanding.

Abstract

We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set.

Paper Structure

This paper contains 35 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: ( left) Illustration of the frame-, shot-, and video-level encoding pipeline used in this work. Each frame $x^i_{k,\ell}$ is encoded using the frame encoder $f$. The frame embeddings $f(x^i_{k,\ell})$ are then aggregated for each shot using a pooling function $p$ to obtain shot embeddings $e^i_k$. Predictions on the video level are then computed using the prediction functions $g_m$. ( right) Intuitively, we want to choose frame/shot- and video-level losses that embed frames from the same shot close to each other and frames from different shots or videos far apart, while encouraging shot embeddings from the same video to be predictive of each other using (simple) prediction functions.
  • Figure 2: 1000 example mean score and per-category mean score of exemplar from frames (Ex-YT-F), with additional shot-level self-supervision (Ex-YT-S), the proposed method with InfoNCE video-level prediction across 4 shots (VIVI-Ex(4)) and additionally 3$\times$wider architecture (VIVI-Ex(4)-Big). Both shot and video-level losses improve the overall score, with the gains coming mostly from higher mean accuracy on the natural and structured subsets.
  • Figure 3: Comparison of the 1000 example mean score of the proposed method with exemplar frame/shot-level and InfoNCE video-level prediction across 4 shots (VIVI-Ex(4), and with a 3$\times$ wider architecture (VIVI-Ex(4)-Big)), with ImageNet-based exemplar (Ex-ImageNet) and rotation (Rot-ImageNet) baselines, as well as the multi-task model from doersch2017multi. Our models outperform all baselines on average, and in particular on the structured data sets.
  • Figure 4: Per-data set comparison of our exemplar-based unsupervised model (VIVI-Ex(4)) and its counterpart co-trained with the full ImageNet data set (VIVI-Ex(4)-Co(100%)). The accuracy on most of the natural (red) and specialized (green) data sets improves, with the largest improvements observed on the latter, while the accuracy decreases for about half of the structured data sets (blue).
  • Figure 5: Per-data set comparison of ImageNet-based exemplar (Ex-ImageNet) with VIVI-Ex(4). Training on rather than ImageNet and exploiting temporal information mostly helps on natural (red) and structured (blue) data sets, and slightly hurts for some specialized (green) data sets.
  • ...and 4 more figures