Table of Contents
Fetching ...

Finding Moments in Video Collections Using Natural Language

Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

TL;DR

The paper tackles the challenge of retrieving relevant moments from large collections of untrimmed videos using natural language queries. It introduces SpatioTemporal Alignment with Language (STAL), which represents moments as regions across short video clips and aligns language to these regions with a Chamfer-based cost, enabling efficient indexing and a two-stage retrieval with re-ranking. The approach demonstrates significant improvements over prior single-video methods and achieves substantial speedups and smaller index sizes on DiDeMo and Charades-STA extended datasets, illustrating practical scalability to millions of videos. By combining clip- and object-level features and employing InfoNCE training, STAL reduces moment-frequency biases and provides a robust framework for corpus-scale video grounding with natural language inputs.

Abstract

We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents a video moment as a set of regions within a series of short video clips and aligns a natural language query to the moment's regions. Our alignment cost compares variable-length language and video features using symmetric squared Chamfer distance, which allows for efficient indexing and retrieval of the video moments. Moreover, aligning language features to regions within a video moment allows for finer alignment compared to methods that extract only an aggregate feature from the entire video moment. We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting. We show that our STAL re-ranking model outperforms the recently proposed Moment Context Network on all criteria across all datasets on our proposed task, obtaining relative gains of 37% - 118% for average recall and up to 30% for median rank. Moreover, our approach achieves more than 130x faster retrieval and 8x smaller index size with a 1M video corpus in an approximate setting.

Finding Moments in Video Collections Using Natural Language

TL;DR

The paper tackles the challenge of retrieving relevant moments from large collections of untrimmed videos using natural language queries. It introduces SpatioTemporal Alignment with Language (STAL), which represents moments as regions across short video clips and aligns language to these regions with a Chamfer-based cost, enabling efficient indexing and a two-stage retrieval with re-ranking. The approach demonstrates significant improvements over prior single-video methods and achieves substantial speedups and smaller index sizes on DiDeMo and Charades-STA extended datasets, illustrating practical scalability to millions of videos. By combining clip- and object-level features and employing InfoNCE training, STAL reduces moment-frequency biases and provides a robust framework for corpus-scale video grounding with natural language inputs.

Abstract

We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents a video moment as a set of regions within a series of short video clips and aligns a natural language query to the moment's regions. Our alignment cost compares variable-length language and video features using symmetric squared Chamfer distance, which allows for efficient indexing and retrieval of the video moments. Moreover, aligning language features to regions within a video moment allows for finer alignment compared to methods that extract only an aggregate feature from the entire video moment. We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting. We show that our STAL re-ranking model outperforms the recently proposed Moment Context Network on all criteria across all datasets on our proposed task, obtaining relative gains of 37% - 118% for average recall and up to 30% for median rank. Moreover, our approach achieves more than 130x faster retrieval and 8x smaller index size with a 1M video corpus in an approximate setting.

Paper Structure

This paper contains 35 sections, 12 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Problem statement and approach overview.(a) Given a natural language query, we seek to find relevant videos from a large corpus of untrimmed, unsegmented videos and temporally localize relevant moments within the returned videos. (b) Our approach aligns natural language queries to regions that compose the candidate moment.
  • Figure 2: Our SpatioTemporal Alignment with Language (STAL) model (a) at a glance and (b) with additional details of the spatiotemporal alignment. Our model is a neural network comprising two branches that align clip features $f^{(v)}_{Clips}$ and object features $f^{(v)}_{Obj}$ extracted from the video $v$ to the corresponding word representations $f^{(q)}_{Clips}$ and $f^{(q)}_{Obj}$ of language query $q$ via the evaluation of the alignment cost (\ref{['eqn:cost_alignment_chamfer']}). TEF refers to temporal endpoint features that encode the position of the clip in the video. See text for details.
  • Figure 3: System for indexing and retrieval. (a) Our approach allows for efficient storage and retrieval of variable-length moments in a video collection database via a two-stage approach. The first stage, efficient retrieval, uses a lightweight version of our model, termed (b) "STAL (clips)". Clip features $f^{(v)}_{\text{ Clips}}$ are matched to the query vector $f^{(q)}$ by means of Euclidean distance. We train "STAL (clips)" embedding with the cost (\ref{['eqn:cost_alignment_euclidean']}). The second stage, re-ranking, ranks all possible moments of different lengths containing the top retrieved clips from the first stage, using the cost (\ref{['eqn:cost_alignment_chamfer']}). See text for details.
  • Figure 4: Video corpus retrieval qualitative results.We show top temporally localized moment retrievals for different natural langauge queries across all videos in DiDeMo hendricks2017localizing and Charades-STA gao2017tall. Ground truth annotations appear as a green line below a video, best viewed in color. Refer to the appendix for videos and more results.