InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Mohamed Elmoghany; Liangbing Zhao; Xiaoqian Shen; Subhojyoti Mukherjee; Yang Zhou; Gang Wu; Viet Dac Lai; Seunghyun Yoon; Ryan Rossi; Abdullah Rashwan; Puneet Mathur; Varun Manjunatha; Daksh Dangi; Chien Nguyen; Nedim Lipka; Trung Bui; Krishna Kumar Singh; Ruiyi Zhang; Xiaolei Huang; Jaemin Cho; Yu Wang; Namyong Park; Zhengzhong Tu; Hongjie Chen; Hoda Eldardiry; Nesreen Ahmed; Thien Nguyen; Dinesh Manocha; Mohamed Elhoseiny; Franck Dernoncourt

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eldardiry, Nesreen Ahmed, Thien Nguyen, Dinesh Manocha, Mohamed Elhoseiny, Franck Dernoncourt

TL;DR

A background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships is introduced, and a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames is proposed.

Abstract

Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 13 figures, 3 tables)

This paper contains 20 sections, 8 equations, 13 figures, 3 tables.

Introduction
Related works
Methods
Problem Formulation
Multi-Agent Narrative for Long-Video Planning
Enhancing Background Consistency via Location Injection
Towards Smooth Multi-Character Transition
Synthetic Data Creation & Filtering
Experiments
Experiment Setup
Main Results
Ablation Studies
Conclusions
Experiments
Human Studies.
...and 5 more sections

Figures (13)

Figure 1: Our pipeline takes an input of story text along with characters' reference images. It outputs a long video with background consistency and smooth transitions. The figure shows our pipeline capability of generating multi-shot scene with consistent background. It also shows a smooth transition between different shots in one scene (smooth shot-to-shot transitions) emphasizing that multi-subject characters do not appear or disappear suddenly. Our model is the first of its kind in multi-subject smooth transitions.
Figure 2: Overview of the proposed storytelling video generation pipeline. Green shapes: are the output of the agentic pipeline. Purple Shapes: Narrative odd shots generate keyframe images which are used to generate video shots using I2V. Red shapes: While the transition in-between (even) shots take the next keyframe and the last frame from the generated I2V shot to generate a First-Last-Frame-to-Video (FLF2V) which smoothly bridges consecutive narrative shots. The output video would be stitched together to form one coherent video, i.e., shot-1 (I2V) $\to$ shot-2 (FLF2V) $\to$ shot-3 (I2V) $\to$ shot-4 (FLF2V) $\to$ .. and so on.
Figure 3: This novel framework is designed to create a large dataset for smooth transitions to train the First-Last-Frame-to-Video (FLF2V). We use 4 agentic pipelines to generate the prompt that generates the transition video. Then we use VLM to filter out bad videos which does not have the correct number of characters and to generate another prompt for the transition video. We use the combination of prompts from the video prompter and VLM along with the generated video, first-frame and last-frame of the video to train the FLF2V.
Figure 4: Our pipeline results in smooth transitions
Figure 5: Website we developed to collect human evaluations with Video A, B, and C altering between the three methods.
...and 8 more figures

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

TL;DR

Abstract

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Authors

TL;DR

Abstract

Table of Contents

Figures (13)