FlexiFilm: Long Video Generation with Flexible Conditions
Yichen Ouyang, jianhao Yuan, Hao Zhao, Gaoang Wang, Bo zhao
TL;DR
FlexiFilm tackles the core challenge of long video generation by introducing a temporal conditioner and a resampling strategy within a latent diffusion framework. The temporal conditioner enables multi-modal, temporally-aware guidance, while co-training aligns temporal cues between the conditioner and the diffusion backbone to improve inter-frame consistency; a resampling step curbs non-zero SNR issues during multi-round inference to maintain quality. Key contributions include the temporal conditioner with a video projector, a temporal-consist co-training scheme, and a resampling strategy enabling recursive long-video generation, validated on a new FF-Drive1 dataset and across driving and natural domains. The approach yields longer, more coherent videos (over 30 seconds, with 200+ frames in experiments) and improves both visual quality and consistency compared with baselines, highlighting its practical potential for realistic long-form video synthesis.
Abstract
Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/
