VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen; Yuanxing Zhang; Chongling Rao; Yushuo Guan; Jiaheng Liu; Fuzheng Zhang; Chengru Song; Qiang Liu; Di Zhang; Tieniu Tan

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan

TL;DR

VidCapBench tackles the misalignment between video caption quality and controllable text-to-video generation by introducing a format-agnostic, multi-dimension evaluation framework. It combines a two-stage data workflow (AE for automated, HE for human) with a rich QA-pair dataset across four dimensions—Video Aesthetics, Video Content, Video Motion, and Physical Laws—to ensure robust caption assessment. Empirical results show VidCapBench offers superior stability compared with existing benchmarks and that VidCapBench scores strongly correlate with downstream T2V quality, including training-free verification across multiple T2V models. This work provides a practical, transferable benchmark that can guide caption improvements to enhance T2V generation in real-world applications.

Abstract

The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

TL;DR

Abstract

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)