Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization
Tao Yu, Yujia Yang, Haopeng Jin, Junhao Gong, Xinlong Chen, Yuxuan Zhou, Shanbin Zhang, Jiabing Yang, Xinming Wang, Hongzhu Yi, Ping Nie, Kai Zou, Zhang Zhang, Yan Huang, Liang Wang, Yeshani, Ruiwen Tao, Jin Ma, Haijin Liang, Jinwen Luo
TL;DR
RVMS-Bench tackles the mismatch between idealized benchmarks and real-world video search by modeling fuzzy, multi-dimensional memories and open-web retrieval. The authors introduce RACLO, an abductive-reasoning based agent that emulates Recall-Search-Verify to locate videos and moments across the internet. The benchmark comprises 1,440 human-verified samples across 20 topics and four durations, paired with a generation-and-verification pipeline to ensure high-quality ground truth. Experiments show that state-of-the-art MLLMs struggle with open-web retrieval and precise moment localization under memory-fragment cues, underscoring the need for robust long-horizon reasoning and agent-based search strategies. Overall, RVMS-Bench provides a scalable platform and actionable blueprint for advancing video retrieval robustness in real-world, unstructured scenarios.
Abstract
Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
