Table of Contents
Fetching ...

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas Brox, Mike Zheng Shou

TL;DR

This work tackles the information gap between richly described video content and its textual descriptions in text-video retrieval by adopting a data-centric strategy. It enriches textual representations at training time with event-level video captions and at retrieval time with diverse LLM-generated queries, further refined by a query selection mechanism (Farthest Query Sampling) and the Oracle Query benchmark. The approach achieves state-of-the-art results across MSR-VTT, MSVD, VATEX, and LSMDC, demonstrating the power of data-centric enrichment to improve cross-modal retrieval without overhauling model architectures. The findings highlight practical gains in retrieval accuracy and efficiency, and point to future avenues for integrating data-centric enrichment with foundation models and scalable query selection. $R$@$K$ improvements and ablations underscore the value of aligning textual cues with video content through structured, diverse queries.

Abstract

As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

TL;DR

This work tackles the information gap between richly described video content and its textual descriptions in text-video retrieval by adopting a data-centric strategy. It enriches textual representations at training time with event-level video captions and at retrieval time with diverse LLM-generated queries, further refined by a query selection mechanism (Farthest Query Sampling) and the Oracle Query benchmark. The approach achieves state-of-the-art results across MSR-VTT, MSVD, VATEX, and LSMDC, demonstrating the power of data-centric enrichment to improve cross-modal retrieval without overhauling model architectures. The findings highlight practical gains in retrieval accuracy and efficiency, and point to future avenues for integrating data-centric enrichment with foundation models and scalable query selection. @ improvements and ablations underscore the value of aligning textual cues with video content through structured, diverse queries.

Abstract

As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.
Paper Structure (30 sections, 11 equations, 9 figures, 12 tables)

This paper contains 30 sections, 11 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Videos contain much richer information than text. A video can be described by numerous possible text queries, while some of them are missing in the data.
  • Figure 2: Illustration of the unified text enrichment framework. The left part shows the process of enriching text representations of training data via a comprehensive video captioning approach, including an event-aware video temporal segmentation module and a pre-trained captioner. In the middle, we adopt a dual-encoder model with a text-conditioned video pooling design. The right part illustrates text enrichment during retrieval phase, where a query generation module, a query selection module, and an aggregation module work together to enhance the retrieval performance.
  • Figure 3: Illustration of Farthest Query Sampling (FQS) algorithm. The queries are distributed within a certain range of relevance. The blue point (user query) is set as the root query. At each step, FQS samples the query that is farthest from all previous sampled queries.
  • Figure 4: Prompts used for text enrichment in retrieval.
  • Figure 5: Rephrasing concepts. The original query focuses on the concept 'team'. Although the concept of 'team' may imply the involvement of 'multiple players', it is not explicitly stated. The enriched query provides a more explicit view to elaborate the query, correcting the retrieval.
  • ...and 4 more figures