Table of Contents
Fetching ...

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran

TL;DR

This paper addresses the gap in long video retrieval benchmarks by introducing the 10k Words benchmark, which uses LLM-generated diverse captions to capture the wide range of valid descriptions for long videos. It presents a scalable pipeline to generate, analyze, and fuse diverse captions across three axes—duration, summarization, and simplification—creating ActivityNet10k, QuerYD10k, and LF-VILA10k. Empirically, SOTA video-language models struggle with short captions, but combining 10k-caption training data and inference-time caption ensembles yields notable gains in $R@1$ on both zero-shot and finetuned settings, including up to +3.4% $R@1$ in zero-shot. The work also provides extensive automatic and human analyses of data fidelity, error modes for short captions, and practical notes on prompts, costs, and ablations, highlighting the value of synthetic diverse captions for advancing long-video retrieval.

Abstract

Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We use synthetic captions from this pipeline to perform a benchmark of a representative set of video language models using long video datasets, and show that the models struggle on shorter captions. We show that finetuning on this data can both mitigate these issues (+2.8% R@1 over SOTA on ActivityNet with diverse captions), and even improve performance on standard paragraph-to-video retrieval (+1.0% R@1 on ActivityNet). We also use synthetic data from our pipeline as query expansion in the zero-shot setting (+3.4% R@1 on ActivityNet). We derive insights by analyzing failure cases for retrieval with short captions. For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words.

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

TL;DR

This paper addresses the gap in long video retrieval benchmarks by introducing the 10k Words benchmark, which uses LLM-generated diverse captions to capture the wide range of valid descriptions for long videos. It presents a scalable pipeline to generate, analyze, and fuse diverse captions across three axes—duration, summarization, and simplification—creating ActivityNet10k, QuerYD10k, and LF-VILA10k. Empirically, SOTA video-language models struggle with short captions, but combining 10k-caption training data and inference-time caption ensembles yields notable gains in on both zero-shot and finetuned settings, including up to +3.4% in zero-shot. The work also provides extensive automatic and human analyses of data fidelity, error modes for short captions, and practical notes on prompts, costs, and ablations, highlighting the value of synthetic diverse captions for advancing long-video retrieval.

Abstract

Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We use synthetic captions from this pipeline to perform a benchmark of a representative set of video language models using long video datasets, and show that the models struggle on shorter captions. We show that finetuning on this data can both mitigate these issues (+2.8% R@1 over SOTA on ActivityNet with diverse captions), and even improve performance on standard paragraph-to-video retrieval (+1.0% R@1 on ActivityNet). We also use synthetic data from our pipeline as query expansion in the zero-shot setting (+3.4% R@1 on ActivityNet). We derive insights by analyzing failure cases for retrieval with short captions. For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words.
Paper Structure (21 sections, 8 figures, 16 tables)

This paper contains 21 sections, 8 figures, 16 tables.

Figures (8)

  • Figure 1: In real-world text-to-video retrieval, users could use diverse queries. Standard long video datasets use only paragraph-style captions ("Existing", "Full paragraph"), which does not allow for training or evaluation on a representative set of long video descriptions. Practical applications also require the ability to handle complex, short, and partial descriptions of a long video. In this work, we introduce an approach to generate, evaluate, and train on such diverse video description data.
  • Figure 2: We plot standard caption retrieval results for each item in ActivityNet, sorted by rank. We also plot the retrieval for a few synthetic caption types, sorted by standard caption retrieval rank. For many samples, synthetic captions yield superior retrievals.
  • Figure 3: We measure the length and retrieval uniqueness for short caption retrieval, and find that the highest ranks correlate with captions that have lost their unique information.
  • Figure 4: We measure uniqueness and plausibility for short captions with bad retrievals. We find that most difficult samples tend to be non-unique and have many plausible correct retrievals.
  • Figure 5: We perform contrastive finetuning for retrieval with video-caption pairs. We propose efficient sampling of our 10k text captions for data augmentation, where we compute standard contrastive loss, but each caption is sampled randomly from the 10k captions for a given video, according to a mixing ratio, $\eta$.
  • ...and 3 more figures