Table of Contents
Fetching ...

Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

TL;DR

The paper tackles the challenge of long-form video generation with consistent character identities by proposing a deterministic, multistage pipeline that mimics filmmaking. It leverages an LLM to produce a structured script blueprint, uses an asset-first approach for stable character visuals, and applies a temporal bridge to coherently connect scene clips into a final video with synchronized audio. Key contributions include an ablation-driven demonstration of the visual anchor's importance for identity retention, a detailed bias analysis showing Subject-World decoupling in Indian contexts, and a robust evaluation framework combining automated and ML-based judgments. The results show superior character consistency and prompt adherence over baselines, while highlighting the need for diverse training data to address cultural biases in future T2V systems.

Abstract

Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

TL;DR

The paper tackles the challenge of long-form video generation with consistent character identities by proposing a deterministic, multistage pipeline that mimics filmmaking. It leverages an LLM to produce a structured script blueprint, uses an asset-first approach for stable character visuals, and applies a temporal bridge to coherently connect scene clips into a final video with synchronized audio. Key contributions include an ablation-driven demonstration of the visual anchor's importance for identity retention, a detailed bias analysis showing Subject-World decoupling in Indian contexts, and a robust evaluation framework combining automated and ML-based judgments. The results show superior character consistency and prompt adherence over baselines, while highlighting the need for diverse training data to address cultural biases in future T2V systems.

Abstract

Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

Paper Structure

This paper contains 31 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: The Proposed Pipeline. A high-level user prompt initiates an LLM to generate a detailed script and character descriptions. A Text-to-Image model creates visual assets for characters. For each scene, an Image-to-Image model generates an initial frame, which then guides an Image-to-Video model to synthesize the scene clip. All clips are then merged into the final output video.
  • Figure 2: Distribution of Evaluation Scores.Box plots summarizing the performance across all test samples.
  • Figure 3: The distribution of scores for Script Adherence versus Prompt Adherence across all generated clips.
  • Figure 4: While Subject Consistency remains comparable across demographics, World Consistency drops significantly for Indian videos, indicating contextual fragility.
  • Figure 5: Stratified Analysis. The stability gap is minimal in static scenes (Low Motion) but widens drastically in dynamic scenes (High Motion), confirming that the bias is motion-dependent.