Table of Contents
Fetching ...

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang

TL;DR

BIVDiff tackles the high cost and temporal inconsistencies of video diffusion by training-freely bridging a task-specific IDM with a general VDM. The method decouples frame-wise generation, Mixed Inversion to align latent distributions, and temporal smoothing to produce coherent videos across controllable generation, editing, inpainting, and outpainting. It introduces Mixed Inversion to balance information from image and video inversions, enabling stable cross-model synthesis without per-video optimization. Empirical results show strong temporal coherence and fidelity, with broad generalization across tasks and flexible model choices, making training-free video synthesis more practical for diverse applications.

Abstract

Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {\bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

TL;DR

BIVDiff tackles the high cost and temporal inconsistencies of video diffusion by training-freely bridging a task-specific IDM with a general VDM. The method decouples frame-wise generation, Mixed Inversion to align latent distributions, and temporal smoothing to produce coherent videos across controllable generation, editing, inpainting, and outpainting. It introduces Mixed Inversion to balance information from image and video inversions, enabling stable cross-model synthesis without per-video optimization. Empirical results show strong temporal coherence and fidelity, with broad generalization across tasks and flexible model choices, making training-free video synthesis more practical for diverse applications.

Abstract

Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {\bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.
Paper Structure (18 sections, 4 equations, 15 figures, 1 table)

This paper contains 18 sections, 4 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Given an image diffusion model (IDM) for a specific image synthesis task, and a text-to-video diffusion foundation model (VDM), our model can perform training-free video synthesis, by bridging IDM and VDM.
  • Figure 2: BIVDiff pipeline. Our framework consists of three components, including Frame-wise Video Generation, Mixed Inversion, and Video Temporal Smoothing. We first use the image diffusion model to do frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for video temporal smoothing.
  • Figure 3: Qualitative results of our proposed BIVDiff on controllable video generation task, conditioned on depth maps, canny edges and human pose sequence. We choose ControlNet zhang2023adding as our image diffusion model.
  • Figure 4: Qualitative results of our proposed BIVDiff on video editing task. We select two popular image editing methods, Instruct Pix2Pix brooks2023instructpix2pix and Prompt2Prompt DBLP:conf/iclr/HertzMTAPC23 as image models, and test a wide range of editing types.
  • Figure 5: Qualitative results of our proposed BIVDiff on video inpainting and outpainting task. We adopt Stable Diffusion Inpainting rombach2022high as our image model. Our method can erase objects and complete the masked regions well.
  • ...and 10 more figures