Table of Contents
Fetching ...

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

Tianyi Lu, Xing Zhang, Jiaxi Gu, Renjing Pei, Songcen Xu, Xingjun Ma, Hang Xu, Zuxuan Wu

TL;DR

This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos.

Abstract

Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, compared to text-to-image (T2I) editing, text-to-video (T2V) editing suffers from a lack of decent temporal consistency and structure, due to insufficient pre-training data, limited model editability, or extensive tuning costs. To address this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework that achieves high-quality T2V editing by integrating various T2I and T2V LDMs. Specifically, FLDM utilizes a hyper-parameter with an update schedule to effectively fuse image and video latents during the denoising process. This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos. It is worth noting that FLDM can serve as a versatile plugin, applicable to off-the-shelf image and video LDMs, to significantly enhance the quality of video editing. Extensive quantitative and qualitative experiments on popular T2I and T2V LDMs demonstrate FLDM's superior editing quality than state-of-the-art T2V editing methods. Our project code is available at https://github.com/lutianyi0603/fuse_your_latents.

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

TL;DR

This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos.

Abstract

Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, compared to text-to-image (T2I) editing, text-to-video (T2V) editing suffers from a lack of decent temporal consistency and structure, due to insufficient pre-training data, limited model editability, or extensive tuning costs. To address this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework that achieves high-quality T2V editing by integrating various T2I and T2V LDMs. Specifically, FLDM utilizes a hyper-parameter with an update schedule to effectively fuse image and video latents during the denoising process. This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos. It is worth noting that FLDM can serve as a versatile plugin, applicable to off-the-shelf image and video LDMs, to significantly enhance the quality of video editing. Extensive quantitative and qualitative experiments on popular T2I and T2V LDMs demonstrate FLDM's superior editing quality than state-of-the-art T2V editing methods. Our project code is available at https://github.com/lutianyi0603/fuse_your_latents.
Paper Structure (12 sections, 5 equations, 10 figures, 3 tables)

This paper contains 12 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: FLDM can serve as a versatile plugin that can be applied to off-the-shelf image diffusion models (e.g., InstructPix2Pix brooks2023instructpix2pix and ControlNet zhang2023adding) and video diffusion models (e.g., VidRD gu2023reuse).
  • Figure 2: Denoising process of T2I and T2V LDMs (+ Van Gogh style). First column: Without FLDM, T2I LDMs have good structure preservation but lack temporal consistency, T2V LDMs lack structure preservation but achieve good temporal consistency. Last column: Both the structure and temporal consistency of the edited videos are enhanced with multi-source latent fusion (FLDM). Best viewed from project homepage.
  • Figure 3: FLDM framework for T2V editing. During the inference stage, the input video is encoded via VAE Encoder to be a clean latent $z_0\in \mathbb{R}^{ f\times c\times h\times w}$ and then inverted to be a noisy latent $z_T\in \mathbb{R}^{f\times c\times h\times w}$ through DDIM inversion. During the first $\tau$ timesteps, the T2V LDM and T2I LDM predict noise for noisy latent respectively. In the next $T-\tau$ timesteps, a multi-source latent fusion module is applied to fuse denoised latents from T2V and T2I LDMs.
  • Figure 4: Qualitative comparison with SOTA approaches. FLDM has the best textual alignment, temporal consistency, and fidelity.
  • Figure 5: Object, background, and style editing results of FLDM. In comparison with T2I models only, FLDM with VidRD generates more consistent frames. Compared to videos edited by VidRD only, FLDM is superior in structure and fidelity.
  • ...and 5 more figures