LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai; Hao Liang; Hejun Dong; Meiyi Qiang; Ruichuan An; Zhaoyang Han; Zhengzhou Zhu; Bin Cui; Wentao Zhang

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang

TL;DR

LoVR addresses the need for robust evaluation of long-form video-text retrieval by introducing a large-scale, high-quality dataset of 467 long videos and 40,804 fine-grained clips with captions generated through a Vision-Language Model pipeline and validated by EVQAScore plus human checks. A semantic fusion approach constructs coherent full-video captions from clip-level captions, enabling both full-video and clip-level retrieval evaluation. Extensive experiments show existing baselines struggle on LoVR, revealing gaps in long-temporal modeling and cross-modal alignment, and demonstrating the benchmark’s value for driving advances in multimodal video understanding. The authors provide a scalable caption generation framework and release the dataset and code to encourage future research in long-form video retrieval and reasoning.

Abstract

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

TL;DR

Abstract

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)