Table of Contents
Fetching ...

FlexiFilm: Long Video Generation with Flexible Conditions

Yichen Ouyang, jianhao Yuan, Hao Zhao, Gaoang Wang, Bo zhao

TL;DR

FlexiFilm tackles the core challenge of long video generation by introducing a temporal conditioner and a resampling strategy within a latent diffusion framework. The temporal conditioner enables multi-modal, temporally-aware guidance, while co-training aligns temporal cues between the conditioner and the diffusion backbone to improve inter-frame consistency; a resampling step curbs non-zero SNR issues during multi-round inference to maintain quality. Key contributions include the temporal conditioner with a video projector, a temporal-consist co-training scheme, and a resampling strategy enabling recursive long-video generation, validated on a new FF-Drive1 dataset and across driving and natural domains. The approach yields longer, more coherent videos (over 30 seconds, with 200+ frames in experiments) and improves both visual quality and consistency compared with baselines, highlighting its practical potential for realistic long-form video synthesis.

Abstract

Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/

FlexiFilm: Long Video Generation with Flexible Conditions

TL;DR

FlexiFilm tackles the core challenge of long video generation by introducing a temporal conditioner and a resampling strategy within a latent diffusion framework. The temporal conditioner enables multi-modal, temporally-aware guidance, while co-training aligns temporal cues between the conditioner and the diffusion backbone to improve inter-frame consistency; a resampling step curbs non-zero SNR issues during multi-round inference to maintain quality. Key contributions include the temporal conditioner with a video projector, a temporal-consist co-training scheme, and a resampling strategy enabling recursive long-video generation, validated on a new FF-Drive1 dataset and across driving and natural domains. The approach yields longer, more coherent videos (over 30 seconds, with 200+ frames in experiments) and improves both visual quality and consistency compared with baselines, highlighting its practical potential for realistic long-form video synthesis.

Abstract

Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/
Paper Structure (29 sections, 3 equations, 9 figures, 1 table)

This paper contains 29 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Structure of video projector. In the proposed video projector, the condition frames pass through IP samplers separately to obtain independent spatial information, and then go through temporal transformers together to learn inter-frame temporal information. After that, the finally obtained projected feature contains rich information both spatially and temporally.
  • Figure 2: FlexiFilm pipeline. The subplot (a) shows the backbone of FlexiFilm is a 3D U-Net working on the latent space of VAE, using a temporal conditioner for multi-modal (text, image or video) referencing for video frames generate. The subplot (b) shows the workflow of the proposed temporal conditioner, where visual contents (image or frames) and text contents are fused to guide the video generation process with both spatial and temporal information.
  • Figure 3: Temporal module co-training. All modules containing learnable parameters are drawn with a fire icon, and the two paths of gradient pass-backing are marked with green arrows, occurring with probabilities $p$ and $p-1$ respectively.
  • Figure 4: Non-zero SNR problem. In the left subplot, the last columns of rows 2 and 4 are the denoised outputs of model from two noisy images at t=1000, but they have different mean distributions. In the right subplot, after 2 round of inference (using the last frame generated on round 1 as the condition for round 2), the generated frames become overexposed, but this can be solved with our resampling strategy.
  • Figure 5: Comparison with baselines. The comparison results show the good performance of our model in long video generation (200+ frames), while the baselines produce less consistent results and can generate only short clips (32- frames). In the video generated by SVD, as the vehicle moves forward, the buildings unreasonably maintain their original size; while the video generated by VideoCrafter has a different structure from the reference image and lack inter-frame consistency.
  • ...and 4 more figures