Table of Contents
Fetching ...

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang

TL;DR

LoVR addresses the need for robust evaluation of long-form video-text retrieval by introducing a large-scale, high-quality dataset of 467 long videos and 40,804 fine-grained clips with captions generated through a Vision-Language Model pipeline and validated by EVQAScore plus human checks. A semantic fusion approach constructs coherent full-video captions from clip-level captions, enabling both full-video and clip-level retrieval evaluation. Extensive experiments show existing baselines struggle on LoVR, revealing gaps in long-temporal modeling and cross-modal alignment, and demonstrating the benchmark’s value for driving advances in multimodal video understanding. The authors provide a scalable caption generation framework and release the dataset and code to encourage future research in long-form video retrieval and reasoning.

Abstract

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

TL;DR

LoVR addresses the need for robust evaluation of long-form video-text retrieval by introducing a large-scale, high-quality dataset of 467 long videos and 40,804 fine-grained clips with captions generated through a Vision-Language Model pipeline and validated by EVQAScore plus human checks. A semantic fusion approach constructs coherent full-video captions from clip-level captions, enabling both full-video and clip-level retrieval evaluation. Extensive experiments show existing baselines struggle on LoVR, revealing gaps in long-temporal modeling and cross-modal alignment, and demonstrating the benchmark’s value for driving advances in multimodal video understanding. The authors provide a scalable caption generation framework and release the dataset and code to encourage future research in long-form video retrieval and reasoning.

Abstract

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

Paper Structure

This paper contains 38 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of the data construction pipeline in LoVR. Step 1 segments long videos into high-dynamic clips using visual change detection and threshold-based filtering. Step 2 generates high-quality clip-level captions via iterative VLM captioning and human fallback based on EVQAScore. Step 3 constructs long-video captions by clustering clip captions and summarizing them into a full-length description. Finally, a human review process is conducted to ensure quality.
  • Figure 2: Illustration of Video Dynamics: High vs. Low Scene Dynamics. Example video frames are shown in the upper part (high-dynamic on the left, low-dynamic on the right), while the lower part presents the computed inter-frame difference curves. The high-dynamic video exhibits large frame-to-frame variations, whereas the low-dynamic video remains visually stable.
  • Figure 3: The figure shows four panels from left to right: (a) Distribution of match scores and annotation errors on the sampled subset $K'$. The left y-axis shows the match score distribution, while the right y-axis displays the median content and theme annotation errors. A notable reduction in annotation errors occurs when the match score exceeds 0.2. (b) Distribution of caption lengths for long videos, with most captions concentrated around 10,000 tokens. (c) Distribution of clip durations. (d) Distribution of caption lengths for video clips.
  • Figure 4: Recall@K performance comparison across baseline models on LoVR clip-level retrieval.
  • Figure 5: Two illustrative retrieval cases are provided, where the key retrieval targets are highlighted in red.
  • ...and 3 more figures