LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Rongyi Yu; Chenyuan Duan; Wentao Zhang

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Rongyi Yu, Chenyuan Duan, Wentao Zhang

Abstract

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Abstract

Paper Structure (32 sections, 3 figures, 7 tables)

This paper contains 32 sections, 3 figures, 7 tables.

Introduction
Related Work
Long-Video Understanding Benchmarks
Multi-Hop Retrieval and Necessity Verification
Tool-Augmented Video Agents
Data Synthesis
The LongVidSearch Dataset Construction
Data Source: The LoVR Dataset
Taxonomy of Multi-Hop Retrieval Tasks
The Agentic Construction Pipeline
LongVidSearch Statistics
Tools
Experiments
Experimental Settings
Evaluation Metrics.
...and 17 more sections

Figures (3)

Figure 1: Overview of LongVidSearch. We illustrate the end-to-end pipeline with a representative 2-hop example: the agent iteratively retrieves candidate clips, accesses evidence via standardized tools, and produces a final answer that is scored by a three-judge protocol with majority vote.
Figure 2: Two types of failure analysis.
Figure 3: Data examples of different hop. Each block lists the question, answer, golden clips, reasoning chain,category,hop-level, video-id and verification fields.

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Abstract

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Authors

Abstract

Table of Contents

Figures (3)