Table of Contents
Fetching ...

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Tao Yu, Yujia Yang, Haopeng Jin, Junhao Gong, Xinlong Chen, Yuxuan Zhou, Shanbin Zhang, Jiabing Yang, Xinming Wang, Hongzhu Yi, Ping Nie, Kai Zou, Zhang Zhang, Yan Huang, Liang Wang, Yeshani, Ruiwen Tao, Jin Ma, Haijin Liang, Jinwen Luo

TL;DR

RVMS-Bench tackles the mismatch between idealized benchmarks and real-world video search by modeling fuzzy, multi-dimensional memories and open-web retrieval. The authors introduce RACLO, an abductive-reasoning based agent that emulates Recall-Search-Verify to locate videos and moments across the internet. The benchmark comprises 1,440 human-verified samples across 20 topics and four durations, paired with a generation-and-verification pipeline to ensure high-quality ground truth. Experiments show that state-of-the-art MLLMs struggle with open-web retrieval and precise moment localization under memory-fragment cues, underscoring the need for robust long-horizon reasoning and agent-based search strategies. Overall, RVMS-Bench provides a scalable platform and actionable blueprint for advancing video retrieval robustness in real-world, unstructured scenarios.

Abstract

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

TL;DR

RVMS-Bench tackles the mismatch between idealized benchmarks and real-world video search by modeling fuzzy, multi-dimensional memories and open-web retrieval. The authors introduce RACLO, an abductive-reasoning based agent that emulates Recall-Search-Verify to locate videos and moments across the internet. The benchmark comprises 1,440 human-verified samples across 20 topics and four durations, paired with a generation-and-verification pipeline to ensure high-quality ground truth. Experiments show that state-of-the-art MLLMs struggle with open-web retrieval and precise moment localization under memory-fragment cues, underscoring the need for robust long-horizon reasoning and agent-based search strategies. Overall, RVMS-Bench provides a scalable platform and actionable blueprint for advancing video retrieval robustness in real-world, unstructured scenarios.

Abstract

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
Paper Structure (46 sections, 9 figures, 2 tables)

This paper contains 46 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Paradigm Shift in Video Retrieval. Comparison between traditional video retrieval benchmarks (left) and our proposed RVMS-Bench (right). While traditional methods typically focus on single-dimension similarity matching within a closed candidate pool of short clips, RVMS addresses the real-world challenge of retrieving full-length videos from the open internet using multi-dimensional, fragmented memory cues (Global Impression, Key Moment, Temporal Context, and Auditory Memory).
  • Figure 2: Data Construction Pipeline of RVMS-Bench. The pipeline integrates model-assisted generation with rigorous human verification to ensure scalability and quality. Left: We sample 20 topics from YouTube, employing dual-stream sampling for global sparse frames and local anchors. Middle: Gemini 3 Pro generates hierarchical descriptions covering Global Impression (G), Key Moment (K), Temporal Context (T), and Auditory Memory (A). Right: Human experts verify and refine the generated content to eliminate hallucinations, ensuring the semantic uniqueness of the ground truth.
  • Figure 3: Overview of the RACLO Framework. The framework mimics the human "Recall-Search-Verify" cognitive loop. Stage 1 (Query Reasoning & Search): The Agent Brain employs abductive reasoning to translate fragmented multimodal cues into search queries, retrieving candidate URLs from search engines. Stage 2 (Pre-processing): Candidates are downloaded and processed into audio-visual streams. Stage 3 (Parallel Verification & Localization): A dual-granularity mechanism validates video content against the Global Impression (G) while simultaneously localizing the target frame using Key Moment (K), Temporal Context (T), and Auditory Memory (A).
  • Figure 4: Ablation studies for Video Retrieval (VR) and Moment Localization (ML) tasks. The panels illustrate performance sensitivity across three key technical dimensions: (1) Query Complexity, representing the number of expanded search keywords generated for each query; (2) Search Scale, indicating the number of candidate URLs returned per retrieval execution; (3) Visual Granularity, referring to the total number of frames sampled and extracted from a single video for processing.
  • Figure 5: Data Distributions of RVMS-Bench. Breakdowns of video duration, category, and task types.
  • ...and 4 more figures