Table of Contents
Fetching ...

CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

TL;DR

CaReBench addresses the need for fine-grained evaluation in video captioning and retrieval by providing 1,000 videos with detailed, hierarchically structured captions and explicit spatiotemporal separations. It introduces ReBias and CapST to quantify spatiotemporal biases and demonstrates that retrieval and captioning can be unified under a single mapping from the pixel space to a high-dimensional embedding, expressed as $\phi: \mathbb{R}^{T \times H \times W \times C} \rightarrow \mathbb{R}^D$. The CaRe baseline, built on Qwen2-VL with a two-stage supervised fine-tuning, achieves competitive results against CLIP-based models and various MLLMs, illustrating the practicality of a unified video-language framework for fine-grained tasks. This work advances benchmark design, bias-aware evaluation, and unified modeling for future improvements in video-language understanding.

Abstract

Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video captioning tasks in a unified framework, we develop a simple baseline based on a Multimodal Language Model (MLLM). By implementing a two-stage Supervised Fine-Tuning (SFT), we fully unlock the potential of MLLM, enabling it not only to generate detailed video descriptions but also to extract video features. Surprisingly, experimental results demonstrate that, compared to the CLIP-based models designed for retrieval and the popular MLLMs skilled in video captioning, our baseline shows competitive performance in both fine-grained video retrieval and video detailed captioning.

CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

TL;DR

CaReBench addresses the need for fine-grained evaluation in video captioning and retrieval by providing 1,000 videos with detailed, hierarchically structured captions and explicit spatiotemporal separations. It introduces ReBias and CapST to quantify spatiotemporal biases and demonstrates that retrieval and captioning can be unified under a single mapping from the pixel space to a high-dimensional embedding, expressed as . The CaRe baseline, built on Qwen2-VL with a two-stage supervised fine-tuning, achieves competitive results against CLIP-based models and various MLLMs, illustrating the practicality of a unified video-language framework for fine-grained tasks. This work advances benchmark design, bias-aware evaluation, and unified modeling for future improvements in video-language understanding.

Abstract

Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video captioning tasks in a unified framework, we develop a simple baseline based on a Multimodal Language Model (MLLM). By implementing a two-stage Supervised Fine-Tuning (SFT), we fully unlock the potential of MLLM, enabling it not only to generate detailed video descriptions but also to extract video features. Surprisingly, experimental results demonstrate that, compared to the CLIP-based models designed for retrieval and the popular MLLMs skilled in video captioning, our baseline shows competitive performance in both fine-grained video retrieval and video detailed captioning.
Paper Structure (28 sections, 4 equations, 10 figures, 7 tables)

This paper contains 28 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparision of captions between MSR-VTT MSRVTT, GPT-4o generated data sharegpt4o and CaReBench. The caption in the upper left corner is from MSR-VTT MSRVTT. It only contains short-text coarse descriptions. The annotation located in the lower left corner is generated by GPT-4o sourced from ShareGPT-4o sharegpt4o. It has some coarse-grained, uncertain and wrong descriptions. The fine-grained caption on the right is selected from CaReBench and is created by our human annotators following the pipeline. The green sentences are fine-grained descriptions and the brown words show the temporal sequences in the video.
  • Figure 2: Comparison on the CaReBench performance of CLIP-based retrieval models, MLLM captioning models and our unifed model. The results on MLLMs are reported on their public version without contrastive training. The CLIP-based retrieval model has achieved excellent performance in video retrieval tasks, but it lacks the ability to describe videos. On the other hand, MLLM models are capable of describing videos in detail, but their retrieval performance is very poor. In contrast, CaRe, the unified model we propose, not only delivers outstanding performance in retrieval tasks but also has a strong capability to describe videos. Features are extracted from MLLMs using EOL prompt E5-V.
  • Figure 3: Statistics of CaReBench. Most videos range from 5-20 seconds and most captions fall between 150 and 300 words in length.
  • Figure 4: An overview of the annotation pipeline. In Stage-I, workers are asked to describe videos hierarchically in detail. In Stage-II, workers need to separate spatial descriptions with temporal descriptions.
  • Figure 5: The training recipe of CaRe. In the first stage, we align CaRe outputs to a fine-grained text space, enabling it to describe videos in detail. In the second stage, a contrastive learning method is applied to get features from the inputs. The output space of CaRe shifts from the vocabulary space $\mathbb{R}^{D_v}$ in Stage-I to the embedding space $\mathbb{R}^{D_e}$ in Stage-II.
  • ...and 5 more figures