Table of Contents
Fetching ...

LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

TL;DR

LoCoT2V-Bench addresses the critical gap in evaluating long-form and complex text-to-video generation by coupling a realism-based prompt suite with a multi-dimensional scoring framework (SQ, TVA, TQ, CC, HERD). It constructs 240 prompts from real-world videos across 18 themes using self-refined LLM prompts and assesses nine open-source LVG models with comprehensive metrics, including event-level alignment and narrative quality. Results show that current models excel at static visual fidelity but struggle with long-range temporal coherence, inter-event consistency, and high-level thematic adherence, underscoring the need for improved long-form planning and alignment mechanisms. The benchmark thus provides a practical, robust platform for rigorous LVG evaluation and guides future work toward more coherent, controllable, and human-aligned long-form video generation.

Abstract

Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

TL;DR

LoCoT2V-Bench addresses the critical gap in evaluating long-form and complex text-to-video generation by coupling a realism-based prompt suite with a multi-dimensional scoring framework (SQ, TVA, TQ, CC, HERD). It constructs 240 prompts from real-world videos across 18 themes using self-refined LLM prompts and assesses nine open-source LVG models with comprehensive metrics, including event-level alignment and narrative quality. Results show that current models excel at static visual fidelity but struggle with long-range temporal coherence, inter-event consistency, and high-level thematic adherence, underscoring the need for improved long-form planning and alignment mechanisms. The benchmark thus provides a practical, robust platform for rigorous LVG evaluation and guides future work toward more coherent, controllable, and human-aligned long-form video generation.

Abstract

Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

Paper Structure

This paper contains 43 sections, 16 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of the LoCoT2V-Bench. LoCoT2V-Bench comprehensively evaluates the generated long videos from five dimensions: static quality, text-video alignment, temporal quality, content clarity and Human Expectation Realization Degree (HERD). We obtain our prompts from collected real-world videos via MLLMs and leverage multiple tools to execute our assessment.
  • Figure 2: Statistics of Collected Videos. The two images demonstrate some statistics of our collected videos. left: Distribution of video quantity under different themes. right: Duration distribution of collected videos.
  • Figure 3: Visualization of correlation between static quality and other four dimensions. We display the results as four scatter plots and their regression lines. Note that "SQ" here refers to Static Quality.
  • Figure 4: Prompt Suite Statistics. The two graphs demonstrate some statistics of our prompt suite. left: the word cloud to visualize word distribution of our prompts. right: the prompt length distribution of our prompt suite measured by the number of words.
  • Figure 5: Length and complexity of prompts from different theme categories. We use 500 as the upper bound for average length while 10 for complexity and execute normalization based on them.
  • ...and 7 more figures