End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech; Jean-Baptiste Alayrac; Lucas Smaira; Ivan Laptev; Josef Sivic; Andrew Zisserman

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

TL;DR

The paper tackles the challenge of learning visual representations without manually annotated video data by leveraging uncurated narrated instructional videos from HowTo100M. It introduces MIL-NCE, a Multiple Instance Learning extension of Noise Contrastive Estimation, to address misalignments between video content and spoken narration and to learn a joint video-text embedding from scratch. Through extensive experiments across action recognition, text-to-video retrieval, and action localization/segmentation on eight datasets, the approach achieves strong performance, beating many self-supervised and some fully supervised baselines, and enabling robust off-the-shelf joint representations. The work demonstrates that end-to-end learning from uncurated data is viable and can yield competitive or superior results across diverse video understanding tasks, with MIL-NCE providing a principled mechanism to cope with noisy supervision.

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

TL;DR

Abstract

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)