Table of Contents
Fetching ...

Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

Jiaxin Wu, Xiao-Yong Wei, Qing Li

TL;DR

This work tackles zero-shot text-to-video retrieval, focusing on queries requiring contextual temporal, logical, or causal reasoning over large video corpora. It introduces an adaptive multi-agent framework that dynamically orchestrates four specialized agents—scalable retrieval ($f_S$), contextual reasoning ($f_R$), query reformulation ($f_Q$), and orchestration ($f_O$)—across $T$ iterations with an examination window of size $k$, guided by intermediate feedback. A retrieval-performance memory and shared reasoning traces enable coordinated reformulation and interpretable decision-making. Across TRECVid AVS benchmarks spanning eight years, the approach doubles the performance of the strong GLSCL baseline and outperforms state-of-the-art methods by a substantial margin, while providing transparent reasoning traces. This framework enables scalable, zero-shot, temporally aware video retrieval with robust performance on large-scale corpora.

Abstract

The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries and recovering performance for those that degrade over iterations. These agents are dynamically coordinated by an orchestration agent, which leverages intermediate feedback and reasoning outcomes to guide execution. We also introduce a novel communication mechanism that incorporates retrieval-performance memory and historical reasoning traces to improve coordination and decision-making. Experiments on three TRECVid benchmarks spanning eight years show that our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.

Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

TL;DR

This work tackles zero-shot text-to-video retrieval, focusing on queries requiring contextual temporal, logical, or causal reasoning over large video corpora. It introduces an adaptive multi-agent framework that dynamically orchestrates four specialized agents—scalable retrieval (), contextual reasoning (), query reformulation (), and orchestration ()—across iterations with an examination window of size , guided by intermediate feedback. A retrieval-performance memory and shared reasoning traces enable coordinated reformulation and interpretable decision-making. Across TRECVid AVS benchmarks spanning eight years, the approach doubles the performance of the strong GLSCL baseline and outperforms state-of-the-art methods by a substantial margin, while providing transparent reasoning traces. This framework enables scalable, zero-shot, temporally aware video retrieval with robust performance on large-scale corpora.

Abstract

The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries and recovering performance for those that degrade over iterations. These agents are dynamically coordinated by an orchestration agent, which leverages intermediate feedback and reasoning outcomes to guide execution. We also introduce a novel communication mechanism that incorporates retrieval-performance memory and historical reasoning traces to improve coordination and decision-making. Experiments on three TRECVid benchmarks spanning eight years show that our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
Paper Structure (22 sections, 5 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Snapshot of our dynamic multi-agent retrieval framework applied to a complex query. The orchestration agent adaptively determines which role agents to invoke at each iteration based on intermediate feedback. Compared to retrieval-only baseline, the dynamic agent orchestration yields significantly higher retrieval performance.
  • Figure 2: Overview of our multi-agent video retrieval framework. Given a user query, a scalable retrieval agent (S) selects top-ranked candidate videos from a large corpus. A contextual reasoning agent (R) conducts fine-grained intra-video analysis of the retrieved results, while a query reformulation agent (Q) adapts the query in response to ambiguous or low-quality matches. A multi-agent orchestration agent (O) dynamically determines the execution plan, deciding which agents to invoke at each iteration, based on intermediate reasoning signals and past retrieval outcomes.
  • Figure 3: Performance comparison on complex and more challenging queries that require multi-step reasoning.
  • Figure 4: Rank-1 videos retrieved by GLSCL, IITV and our model, respectively. With contextual temporal reasoning, our model is able to rank correct result the highest.
  • Figure 5: Performance of the adaptive agentic workflow (with Orchestration Agent) versus fixed greedy strategies, measured by the number of accumulated ground truth videos over the ranked list.
  • ...and 3 more figures