Table of Contents
Fetching ...

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

TL;DR

The paper tackles the challenge of learning visual representations without manually annotated video data by leveraging uncurated narrated instructional videos from HowTo100M. It introduces MIL-NCE, a Multiple Instance Learning extension of Noise Contrastive Estimation, to address misalignments between video content and spoken narration and to learn a joint video-text embedding from scratch. Through extensive experiments across action recognition, text-to-video retrieval, and action localization/segmentation on eight datasets, the approach achieves strong performance, beating many self-supervised and some fully supervised baselines, and enabling robust off-the-shelf joint representations. The work demonstrates that end-to-end learning from uncurated data is viable and can yield competitive or superior results across diverse video understanding tasks, with MIL-NCE providing a principled mechanism to cope with noisy supervision.

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

TL;DR

The paper tackles the challenge of learning visual representations without manually annotated video data by leveraging uncurated narrated instructional videos from HowTo100M. It introduces MIL-NCE, a Multiple Instance Learning extension of Noise Contrastive Estimation, to address misalignments between video content and spoken narration and to learn a joint video-text embedding from scratch. Through extensive experiments across action recognition, text-to-video retrieval, and action localization/segmentation on eight datasets, the approach achieves strong performance, beating many self-supervised and some fully supervised baselines, and enabling robust off-the-shelf joint representations. The work demonstrates that end-to-end learning from uncurated data is viable and can yield competitive or superior results across diverse video understanding tasks, with MIL-NCE providing a principled mechanism to cope with noisy supervision.

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Paper Structure

This paper contains 15 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We describe an efficient approach to learn visual representations from misaligned and noisy narrations (bottom) automatically extracted from instructional videos (top). Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.
  • Figure 2: Left. Our MIL-NCE makes it possible to consider a set of multiple positive candidate pairs $\{(x, y), (x, y^1), \dots, (x, y^4)\}$ while the standard NCE approach would only consider the single $(x, y)$ training pair and miss the visually grounded object description sander from pair $(x, y^3)$ or the action description sanding down from $(x, y^4)$. Right. Given a video $x$ and an associated set of positive narration candidates $\mathcal{P}$ (green triangles) that may or may not be correct, our MIL-NCE selects multiple correct positives (large blue areas) while downweighting incorrect positives (smaller blue areas) based on a discriminative ratio against negatives $\mathcal{N}$ (red squares). In contrast, traditional MIL considers only one positive (orange circle) while discarding the rest.
  • Figure 3: Selected video and narration pairs from five positive candidates on HowTo100M held-out samples using MIL-NCE.
  • Figure 4: Model architecture. In this figure, we provide a visualization of the video embedding network $f$ (left) and the text embedding network $g$ (right). Modules displayed in blue are trained from scratch on the challenging uncurated HowTo100M dataset using the MIL-NCE loss. The word embeddings are learned in an unsupervised fashion using Word2Vec trained on GoogleNews and are kept fixed during training. Finally, the dimensions of the outputs of each layer are specified in brackets, e.g. the output of the 'Word2Vec' layer is of size $[16, 300]$ corresponding to the $16$ word embedding vectors of dimension $300$ (one vector for each word, also materialized by the 16 grey rectangles).