Table of Contents
Fetching ...

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin

TL;DR

VCapsBench tackles the lack of fine-grained video caption evaluation by introducing a large-scale benchmark with 5,677 videos and 109,796 QA pairs across 21 dimensions, enabling spatio-temporal assessment for text-to-video generation. It employs dual QA-generation pipelines and an LLM-based TextQA framework to compute $AR$, $IR$, and $CR$ metrics, providing automated, fine-grained evaluation of caption quality. Experiments across ten VLMs show Gemini-2.5-Pro-Preview achieving top $AR$ and $CR$ with low $IR$, underscoring the need for temporally-aware, comprehensive captions. The dataset and evaluation protocol offer a practical, scalable tool for benchmarking video-language models and guiding caption optimization in video generation tasks.

Abstract

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

TL;DR

VCapsBench tackles the lack of fine-grained video caption evaluation by introducing a large-scale benchmark with 5,677 videos and 109,796 QA pairs across 21 dimensions, enabling spatio-temporal assessment for text-to-video generation. It employs dual QA-generation pipelines and an LLM-based TextQA framework to compute , , and metrics, providing automated, fine-grained evaluation of caption quality. Experiments across ten VLMs show Gemini-2.5-Pro-Preview achieving top and with low , underscoring the need for temporally-aware, comprehensive captions. The dataset and evaluation protocol offer a practical, scalable tool for benchmarking video-language models and guiding caption optimization in video generation tasks.

Abstract

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

Paper Structure

This paper contains 15 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Illustration of video caption evaluation by VCapsBench. Evaluate the detail, comprehensiveness, and accuracy of video captions using "yes-no" question-answer pairs.
  • Figure 2: An example of video caption and question-answer pairs in our VCapsBench.
  • Figure 3: (a) Video source distribution; (b) Video duration distribution; (c) Video resolution distribution (d) Video aspect ratio distribution.
  • Figure 4: The pipeline of QA-pairs generation, which includes multiple data processing pipelines and a data correction pipeline.
  • Figure 5: The question length and category distribution.
  • ...and 6 more figures