Table of Contents
Fetching ...

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts

Wenhao Li, Yichao Cao, Xiu Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu

TL;DR

ConFiner tackles the high computational cost and coherence challenges of video diffusion by decoupling video generation into three specialized diffusion-experts tasks: structure control, spatial refinement, and temporal refinement. The two-stage pipeline uses a control expert to establish coarse spatio-temporal structure and a refinement stage with coordinated denoising between spatial and temporal experts, enabling high-quality results with significantly fewer inference steps. An extended ConFiner-Long framework enables long, coherent videos by consistency initialization, coherence guidance, and staggered refinement across segments, achieving up to 600 frames. Empirical results show strong objective and subjective performance with about 10% inference cost compared to baselines, and practical feasibility for filmmaking and animation workflows.

Abstract

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to extreme complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts

TL;DR

ConFiner tackles the high computational cost and coherence challenges of video diffusion by decoupling video generation into three specialized diffusion-experts tasks: structure control, spatial refinement, and temporal refinement. The two-stage pipeline uses a control expert to establish coarse spatio-temporal structure and a refinement stage with coordinated denoising between spatial and temporal experts, enabling high-quality results with significantly fewer inference steps. An extended ConFiner-Long framework enables long, coherent videos by consistency initialization, coherence guidance, and staggered refinement across segments, achieving up to 600 frames. Empirical results show strong objective and subjective performance with about 10% inference cost compared to baselines, and practical feasibility for filmmaking and animation workflows.

Abstract

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to extreme complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.
Paper Structure (18 sections, 13 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Conventional video generation process. (b) Motivation of the proposed ConFiner.
  • Figure 2: Comparison between Our ConFiner-Long and StreamingT2V b54. We exhibit better consistency and imaging quality.
  • Figure 3: Comparison of Our ConFiner-Long with FreeNoise b55. We achieve much better imaging clarity and quality.
  • Figure 4: Pipeline of Our ConFiner and ConFiner-Long. ConFiner decouples the video generation process. Firstly, control expert generates a video structure. Subsequently, temporal and spatial experts perform the refinement of spatio-temporal details. Spatial and temporal experts work together with our coordinated denoising. By adding consistency initialization, coherence guidance and staggered refinement to ConFiner, ConFiner-Long can generate coherent long videos.
  • Figure 5: Ablation Study on Three Strategies of ConFiner-Long. Three strategies work together to achieve coherence between segments.