Table of Contents
Fetching ...

Self-supervised video pretraining yields robust and more human-aligned visual representations

Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff

TL;DR

The paper investigates whether video pretraining yields visual representations that generalize across tasks, are robust to perturbations, and align with human judgments. It introduces VideoNet, a data-curation pipeline to align video distributions with ImageNet, and VITO, a self-supervised contrastive framework with multi-scale attention for distilling video transformations into image representations. Empirically, VITO delivers strong task-general performance, surpasses prior video pretraining on scene understanding, and remains robust under distribution shifts and synthetic deformations, while its predictions align with human perceptual judgments. The results suggest that video pretraining can serve as a simple, effective approach to learning robust, human-aligned, and general visual representations.

Abstract

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

Self-supervised video pretraining yields robust and more human-aligned visual representations

TL;DR

The paper investigates whether video pretraining yields visual representations that generalize across tasks, are robust to perturbations, and align with human judgments. It introduces VideoNet, a data-curation pipeline to align video distributions with ImageNet, and VITO, a self-supervised contrastive framework with multi-scale attention for distilling video transformations into image representations. Empirically, VITO delivers strong task-general performance, surpasses prior video pretraining on scene understanding, and remains robust under distribution shifts and synthetic deformations, while its predictions align with human perceptual judgments. The results suggest that video pretraining can serve as a simple, effective approach to learning robust, human-aligned, and general visual representations.

Abstract

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
Paper Structure (27 sections, 3 equations, 7 figures, 9 tables)

This paper contains 27 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Learning to attend to related video content. Each augmented frame is encoded by the network $f$ as a spatial array of hidden vectors. The attention module $a$ takes as input features from one view and produces a mask that isolates features that are likely to be predictive of the other, temporally-displaced view. The attention-gated features are pooled accordingly, and both the feature extractor and attention module are trained to satisfy the contrastive objective. Subscripts $\theta$ and $\xi$ refer to online and target (EMA) networks respectively.
  • Figure 2: ImageNet-3DCC validation accuracy for different levels of corruption severity. (Left): Comparisons with prior work including methods specifically designed to enhance robustness (SIN+IN1K and L2-Robust). (Right): comparisons with ablations of the VITO method/model.
  • Figure 3: Example human saliency maps from the ClickMe dataset linsley2018learning and ResNet-50 models. Gradient-based saliency is shown for Supervised and Harmonized fel2022harmonizing. Attention maps are shown for CLIP and VITO model. We use multi-head attention pool weights for CLIP and average of weights from last 2 attention pooling scales in VITO.
  • Figure 4: Impact of pretraining data's spatial content on representation quality. Left: transfer performance of models pretrained on single frames from image datasets (grey bars) or individual videos (blue bars). Right: example frames from different video and image datasets.
  • Figure B.1: Example augmented frames with overlaid (resized) learned attention masks. Attention is computed from the output of the final block of the VITO trained ResNet-50. Crucially, the attention masks are computed independently, such that the attention module can only use spatial cues.
  • ...and 2 more figures