Table of Contents
Fetching ...

VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin

TL;DR

VC4VG introduces a dimension-aware captioning framework for text-to-video generation, decomposing captions into five essential dimensions to better support video reconstruction. It pairs a specialized captioner, LLaVA-Video-Gen, with VC4VG-Bench, a generation-focused benchmark for robust automatic and human evaluation. The work demonstrates a strong link between richer, necessity-aligned captions and improved T2V performance through closed-loop fine-tuning and multi-dataset experiments. The approach offers practical guidance for scalable caption generation and benchmarking, with publicly released tools to accelerate research in high-quality T2V training data creation.

Abstract

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.

VC4VG: Optimizing Video Captions for Text-to-Video Generation

TL;DR

VC4VG introduces a dimension-aware captioning framework for text-to-video generation, decomposing captions into five essential dimensions to better support video reconstruction. It pairs a specialized captioner, LLaVA-Video-Gen, with VC4VG-Bench, a generation-focused benchmark for robust automatic and human evaluation. The work demonstrates a strong link between richer, necessity-aligned captions and improved T2V performance through closed-loop fine-tuning and multi-dataset experiments. The approach offers practical guidance for scalable caption generation and benchmarking, with publicly released tools to accelerate research in high-quality T2V training data creation.

Abstract

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.

Paper Structure

This paper contains 32 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of the video caption optimization framework for text-to-video (T2V) generation. The original video is transformed into textual descriptions via captioners. These captions are then optimized according to dimensions that we consider essential for video reconstruction and instruct by VC4VG-Bench evaluation. Finally, optimized captions are used during T2V models' training and generating videos.
  • Figure 2: The core framework of evaluation QA-pairs, structured around five key assessment dimensions. Leveraging dual-reference (video content & textual captions) enables multimodal alignment verification, effectively assisting human annotation to ensure accuracy and comprehensive coverage in evaluation QA-pairs.
  • Figure 3: Illustration of the multi-granularity evaluation QA-pair system specifically designed for video generation tasks. Featuring moderate information clustering in temporal processing, the hierarchical QA-pair architecture based on reconstruction-necessity incorporates multiple scoring points to comprehensively assess caption quality in video generation tasks.
  • Figure 4: Separating scoring metrics: (1) presence of arm movements and (2) movement specificity, to systematically isolate complex information evaluation. Concurrently, character-specific features (e.g., wearing hat, wearing green jacket) are leveraged to formulate diverse reference answers, and therefore enhance answer adaptability across diverse caption.
  • Figure 5: Illustration of representative examples of video caption performance on the benchmark, demonstrating variations in action descriptions.
  • ...and 5 more figures