Table of Contents
Fetching ...

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen

TL;DR

The paper introduces StoryEval, a story-centric benchmark for future long video generation, addressing the gap that existing metrics miss coherent story presentation across consecutive events. It constructs a 423-prompt suite spanning 7 classes and evaluates 11 text-to-video models (3 closed-source, 8 open-source) using two vision-language verifiers (GPT-4o and LLaVA-OV-Chat-72B) in a two-step querying workflow with a unanimous voting scheme. Results show that no model attains more than about 50% average completion, especially on Creative prompts, highlighting a substantial challenge in story-level coherence for long videos and the potential of StoryEval as a complementary metric to traditional detail-oriented benchmarks. Human alignment analyses validate the automated scoring and underscore StoryEval’s reliability, suggesting that future work should prioritize narrative continuity and event-level reasoning for long-form video generation.

Abstract

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

TL;DR

The paper introduces StoryEval, a story-centric benchmark for future long video generation, addressing the gap that existing metrics miss coherent story presentation across consecutive events. It constructs a 423-prompt suite spanning 7 classes and evaluates 11 text-to-video models (3 closed-source, 8 open-source) using two vision-language verifiers (GPT-4o and LLaVA-OV-Chat-72B) in a two-step querying workflow with a unanimous voting scheme. Results show that no model attains more than about 50% average completion, especially on Creative prompts, highlighting a substantial challenge in story-level coherence for long videos and the potential of StoryEval as a complementary metric to traditional detail-oriented benchmarks. Human alignment analyses validate the automated scoring and underscore StoryEval’s reliability, suggesting that future work should prioritize narrative continuity and event-level reasoning for long-form video generation.

Abstract

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.

Paper Structure

This paper contains 30 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Even the top video generative models always fail to completely present some short stories, such as "using 3 steps to place an elephant into the refrigerator". We evaluate some top commercial models (Kling1.5 (10s), Hailuo (6s), and Pika1.5 (6s)) on our StoryEval, whose prompts contain short stories composed of several consecutive events. "completion list" denotes if each event is completed (0/1: in-completed/completed) in the generated videos, and "completion rate" takes its average. For example, in the first prompt, all three models show the man dribbles the basketball, but none of them shows throwing ball, so completion list = [1,0] and completion rate = 50%.
  • Figure 2: StoryEval evaluation on 11 text-to-video generative models. We visualize their completion rates for the stories across 7 classes, along with the average result for the entire set. Even the best model achieves an average completion rate of less than 50%, meaning it can successfully present fewer than half of the events in a simple short story on average. Detailed results are in Table \ref{['tab:all result']}.
  • Figure 3: Pipeline of StoryEval evaluation. We carefully design 423 prompts across 5 classes, and each prompt illustrates a short story containing 2-4 sequential events like Figure \ref{['fig:show']}. For evaluation, we choose 3 top closed-source commercial models and 8 well-known open-source models, use them for text-to-video generation, and then combine the generated videos and the original prompts as input for VLM verifiers. Different from previous detail-oriented evaluation that focus on fine-grained quality features, we let the VLM to judge how many events are successfully presented in the generated videos, and thus get the completion rate of the story in prompt. Many top models have high performance on previous evaluation, but none of them exceed 50% completion rate on StoryEval.
  • Figure 4: Prompt Suite Statistics.(Left) Word cloud of the StoryEval prompt suite (excluding human-related words like "person" to show more diverse terms). (Right) We visualize the proportion of 7 classes in the prompt suite using an UpSet plot. The bottom left of the figure shows the number of prompts in each class, while the right side displays the number of prompts belonging to each class-intersection group. For example, there are 14 examples that exactly belong to all three classes: "Hard", "Creative", and "Object".
  • Figure 5: The process of constructing StoryEval prompt suite. Video examples are selected from three closed-source models: Kling-1.5 kling2024, Hailuo hailuo2024, and Pika-1.5 pika2024.
  • ...and 2 more figures