VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin
TL;DR
VCapsBench tackles the lack of fine-grained video caption evaluation by introducing a large-scale benchmark with 5,677 videos and 109,796 QA pairs across 21 dimensions, enabling spatio-temporal assessment for text-to-video generation. It employs dual QA-generation pipelines and an LLM-based TextQA framework to compute $AR$, $IR$, and $CR$ metrics, providing automated, fine-grained evaluation of caption quality. Experiments across ten VLMs show Gemini-2.5-Pro-Preview achieving top $AR$ and $CR$ with low $IR$, underscoring the need for temporally-aware, comprehensive captions. The dataset and evaluation protocol offer a practical, scalable tool for benchmarking video-language models and guiding caption optimization in video generation tasks.
Abstract
Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.
