Table of Contents
Fetching ...

Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

Yiqing Shen, Chenxiao Fan, Chenjia Li, Mathias Unberath

TL;DR

This work defines reasoning text-to-video retrieval, addressing implicit queries that require multi-step reasoning and object-level grounding within videos. It introduces a two-stage approach that first converts videos into digital twin representations and then uses compositional alignment for fast candidate retrieval, followed by LLM-based reasoning with just-in-time refinement to ground query conditions on refined representations. The authors release ReasonT2VBench-135 and ReasonT2VBench-1000 to quantify performance on implicit queries, demonstrating substantial gains over strong baselines and state-of-the-art results on conventional T2V benchmarks. The approach preserves fine-grained spatial details necessary for grounding and offers a scalable framework for reasoning-driven video retrieval with practical impact on search, recommendation, and context-aware querying.

Abstract

The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

TL;DR

This work defines reasoning text-to-video retrieval, addressing implicit queries that require multi-step reasoning and object-level grounding within videos. It introduces a two-stage approach that first converts videos into digital twin representations and then uses compositional alignment for fast candidate retrieval, followed by LLM-based reasoning with just-in-time refinement to ground query conditions on refined representations. The authors release ReasonT2VBench-135 and ReasonT2VBench-1000 to quantify performance on implicit queries, demonstrating substantial gains over strong baselines and state-of-the-art results on conventional T2V benchmarks. The approach preserves fine-grained spatial details necessary for grounding and offers a scalable framework for reasoning-driven video retrieval with practical impact on search, recommendation, and context-aware querying.

Abstract

The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

Paper Structure

This paper contains 22 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison between (a) traditional text-to-video retrieval and (b) reasoning text-to-video retrieval. Traditional retrieval accepts explicit queries and returns matching videos without identifying target objects. In contrast, reasoning retrieval interprets indirect query that demand multiple reasoning steps, while simultaneously localizing the referenced object through segmentation masks (shown in green) within the retrieved video.
  • Figure 2: Overall framework of the proposed reasoning text-to-video retrieval method.
  • Figure 3: Qualitative comparison of reasoning text-to-video retrieval across three implicit queries on ReasonT2VBench-135. Red borders indicate incorrect or failed retrievals where methods return irrelevant videos, while green borders show correct retrieval and grounding. Our method correctly identifies relevant videos and grounds the target objects via segmentation masks (in purple).