CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang
TL;DR
CaReBench addresses the need for fine-grained evaluation in video captioning and retrieval by providing 1,000 videos with detailed, hierarchically structured captions and explicit spatiotemporal separations. It introduces ReBias and CapST to quantify spatiotemporal biases and demonstrates that retrieval and captioning can be unified under a single mapping from the pixel space to a high-dimensional embedding, expressed as $\phi: \mathbb{R}^{T \times H \times W \times C} \rightarrow \mathbb{R}^D$. The CaRe baseline, built on Qwen2-VL with a two-stage supervised fine-tuning, achieves competitive results against CLIP-based models and various MLLMs, illustrating the practicality of a unified video-language framework for fine-grained tasks. This work advances benchmark design, bias-aware evaluation, and unified modeling for future improvements in video-language understanding.
Abstract
Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video captioning tasks in a unified framework, we develop a simple baseline based on a Multimodal Language Model (MLLM). By implementing a two-stage Supervised Fine-Tuning (SFT), we fully unlock the potential of MLLM, enabling it not only to generate detailed video descriptions but also to extract video features. Surprisingly, experimental results demonstrate that, compared to the CLIP-based models designed for retrieval and the popular MLLMs skilled in video captioning, our baseline shows competitive performance in both fine-grained video retrieval and video detailed captioning.
