Table of Contents
Fetching ...

Localizing Moments in Video with Natural Language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

TL;DR

The paper tackles the problem of localizing moments in video using natural language by introducing the Moment Context Network (MCN), which fuses local moment features with global temporal context and temporal endpoint cues. To train and evaluate moment localization in open-world, unedited videos, the authors build the DiDeMo dataset, containing thousands of moments with referring expressions and extensive validation to ensure referentiality. MCN demonstrates superior retrieval performance over baselines, with ablations showing the value of global context, temporal endpoints, and multi-modal (RGB and flow) inputs. The work provides a new benchmark and methodology that enable precise, language-guided moment localization in real-world videos, with potential applications in personal/video library search and stock footage retrieval.

Abstract

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

Localizing Moments in Video with Natural Language

TL;DR

The paper tackles the problem of localizing moments in video using natural language by introducing the Moment Context Network (MCN), which fuses local moment features with global temporal context and temporal endpoint cues. To train and evaluate moment localization in open-world, unedited videos, the authors build the DiDeMo dataset, containing thousands of moments with referring expressions and extensive validation to ensure referentiality. MCN demonstrates superior retrieval performance over baselines, with ablations showing the value of global context, temporal endpoints, and multi-modal (RGB and flow) inputs. The work provides a new benchmark and methodology that enable precise, language-guided moment localization in real-world videos, with potential applications in personal/video library search and stock footage retrieval.

Abstract

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

Paper Structure

This paper contains 18 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We consider localizing moments in video with natural language and demonstrate that incorporating local and global video features is important for this task. To train and evaluate our model, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 40,000 pairs of localized video moments and corresponding natural language.
  • Figure 2: Our Moment Context Network (MCN) learns a shared embedding for video temporal context features and LSTM language features. Our video temporal context features integrate local video features, which reflect what occurs during a specific moment, global features, which provide context for the specific moment, and temporal endpoint features which indicate when a moment occurs in a video. We consider both appearance and optical flow input modalities, but for simplicity only show the appearance input modality here.
  • Figure 3: Example videos and annotations from our Distinct Describable Moments (DiDeMo) dataset. Annotators describe moments with varied language (e.g., "A cat walks over two boxes" and "An orange cat walks out of a box"). Videos with multiple events (top) have annotations which span all five-second segments. Other videos have segments in which no distinct event takes place (e.g., the end of the bottom video in which no cats are moving).
  • Figure 4: Natural language moment retrieval results on DiDeMo. Ground truth moments are outlined in yellow. The Moment Context Network (MCN) localizes diverse descriptions which include temporal indicators, such as "first" (top), and camera words, such as "camera zooms" (middle).
  • Figure 5: MCN correctly retrieves two different moments (light green rectangle on left and light blue rectangle on right). Though our ground truth annotations are five-second segments, we can evaluate with more fine-grained temporal proposals at test time. This gives a better understanding of when moments occur in video (e.g., "A ball flies over the athletes" occurs at the start of the first temporal segment).
  • ...and 8 more figures