Table of Contents
Fetching ...

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan

TL;DR

VidCapBench tackles the misalignment between video caption quality and controllable text-to-video generation by introducing a format-agnostic, multi-dimension evaluation framework. It combines a two-stage data workflow (AE for automated, HE for human) with a rich QA-pair dataset across four dimensions—Video Aesthetics, Video Content, Video Motion, and Physical Laws—to ensure robust caption assessment. Empirical results show VidCapBench offers superior stability compared with existing benchmarks and that VidCapBench scores strongly correlate with downstream T2V quality, including training-free verification across multiple T2V models. This work provides a practical, transferable benchmark that can guide caption improvements to enhance T2V generation in real-world applications.

Abstract

The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

TL;DR

VidCapBench tackles the misalignment between video caption quality and controllable text-to-video generation by introducing a format-agnostic, multi-dimension evaluation framework. It combines a two-stage data workflow (AE for automated, HE for human) with a rich QA-pair dataset across four dimensions—Video Aesthetics, Video Content, Video Motion, and Physical Laws—to ensure robust caption assessment. Empirical results show VidCapBench offers superior stability compared with existing benchmarks and that VidCapBench scores strongly correlate with downstream T2V quality, including training-free verification across multiple T2V models. This work provides a practical, transferable benchmark that can guide caption improvements to enhance T2V generation in real-world applications.

Abstract

The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.

Paper Structure

This paper contains 32 sections, 19 figures, 11 tables.

Figures (19)

  • Figure 1: VidCapBench evaluates the video captioning model from the aspects of T2V generation.
  • Figure 2: Illustration of the data curation pipeline and the distribution of QA pairs in VidCapBench. The QA pairs are carefully rectified to ensure that they primarily assess the quality of video captions rather than the inherent capabilities of the judge model.
  • Figure 3: An example of the QA pairs for a video in VidCapBench.
  • Figure 4: Illustration of the training-free T2V verification for video caption evaluation. "VA", "SC", "AR", and "LC" denote the four key dimensions of T2V quality evaluation: "Visual Aesthetics", "Subject Consistency", "Action Relevance", and "Logical Coherence", respectively. In this case, the video is associated with nine QA pairs in VidCapBench-AE and four QA pairs in VidCapBench-HE. The similarity between the generated video and the original video, as well as the overall generation quality, are strongly correlated with the evaluation results in VidCapBench. Among the captioning models compared, Gemini exhibits the best performance.
  • Figure 5: Correlations between automated T2V quality evaluations and VidCapBench-AE Acc (upper) and VidCapBench full set Acc (lower). The Pearson correlation coefficient is denoted by "r".
  • ...and 14 more figures