Table of Contents
Fetching ...

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

TL;DR

The paper tackles the challenge of learning robust text-video embeddings without manually annotated captions by leveraging large-scale narrated instructional videos. It introduces HowTo100M, a 136-million-clip dataset sourced from 1.22 million narrated videos, and trains a joint video-language embedding using a two-layer nonlinear model with context gating and a max-margin ranking loss, employing intra-video negative sampling. Across CrossTask, YouCook2, MSR-VTT, and LSMDC, the HowTo100M-based embedding achieves state-of-the-art results on instructional tasks and strong transfer performance when fine-tuned on target domains. The work demonstrates that scale and domain transfer are crucial, and it provides dataset, code, and models to advance video-language research in both instructional and general video settings.

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

TL;DR

The paper tackles the challenge of learning robust text-video embeddings without manually annotated captions by leveraging large-scale narrated instructional videos. It introduces HowTo100M, a 136-million-clip dataset sourced from 1.22 million narrated videos, and trains a joint video-language embedding using a two-layer nonlinear model with context gating and a max-margin ranking loss, employing intra-video negative sampling. Across CrossTask, YouCook2, MSR-VTT, and LSMDC, the HowTo100M-based embedding achieves state-of-the-art results on instructional tasks and strong transfer performance when fine-tuned on target domains. The work demonstrates that scale and domain transfer are crucial, and it provides dataset, code, and models to advance video-language research in both instructional and general video settings.

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

Paper Structure

This paper contains 20 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: We learn a joint text-video embedding by watching millions of narrated video clips of people performing diverse visual tasks. The learned embedding transfers well to other instructional and non-instructional text-video datasets.
  • Figure 2: Examples of clip-caption pairs retrieved with the help of our joint embedding. Pairs are selected based on the similarity between visual appearance and corresponding narration, while they are arranged based on linguistic similarity across pairs. Examples are taken from 4 distinct clusters, corresponding to Knitting, Woodwork/Measuring, Cooking/Seasoning and Electric maintenance.
  • Figure 3: Retrieval and step localization results when varying the training size of our HowTo100M dataset.
  • Figure 4: Evaluation of fine-tuning a HowTo100M pre-trained model with varying amounts of MSR-VTT supervision for text-to-video clip retrieval.
  • Figure 5: Results of clip retrieval by pre-training models on different datasets. Evaluation on LSMDC, YouCook2 and MSR-VTT.
  • ...and 4 more figures