Table of Contents
Fetching ...

FreeInv: Free Lunch for Improving DDIM Inversion

Yuxiang Bao, Huijie Liu, Xun Gao, Huan Fu, Guoliang Kang

TL;DR

FreeInv targets the trajectory deviation in DDIM inversion by introducing a transformation-based latent augmentation that enables an ensemble of trajectories without the heavy cost. It formalizes a one-time Monte Carlo sampling per time-step combined with transformation-based branching to approximate multi-branch ensembles efficiently. The method is architecture-agnostic, compatible with both U-Net and DiT diffusion backbones, and improves reconstruction fidelity for images and videos while preserving editing capabilities and reducing computational burden. Empirical results on PIE and DAVIS demonstrate competitive or superior performance with substantially lower time and memory requirements, making FreeInv well-suited for video inversion and editing workflows.

Abstract

Naive DDIM inversion process usually suffers from a trajectory deviation issue, i.e., the latent trajectory during reconstruction deviates from the one during inversion. To alleviate this issue, previous methods either learn to mitigate the deviation or design cumbersome compensation strategy to reduce the mismatch error, exhibiting substantial time and computation cost. In this work, we present a nearly free-lunch method (named FreeInv) to address the issue more effectively and efficiently. In FreeInv, we randomly transform the latent representation and keep the transformation the same between the corresponding inversion and reconstruction time-step. It is motivated from a statistical perspective that an ensemble of DDIM inversion processes for multiple trajectories yields a smaller trajectory mismatch error on expectation. Moreover, through theoretical analysis and empirical study, we show that FreeInv performs an efficient ensemble of multiple trajectories. FreeInv can be freely integrated into existing inversion-based image and video editing techniques. Especially for inverting video sequences, it brings more significant fidelity and efficiency improvements. Comprehensive quantitative and qualitative evaluation on PIE benchmark and DAVIS dataset shows that FreeInv remarkably outperforms conventional DDIM inversion, and is competitive among previous state-of-the-art inversion methods, with superior computation efficiency.

FreeInv: Free Lunch for Improving DDIM Inversion

TL;DR

FreeInv targets the trajectory deviation in DDIM inversion by introducing a transformation-based latent augmentation that enables an ensemble of trajectories without the heavy cost. It formalizes a one-time Monte Carlo sampling per time-step combined with transformation-based branching to approximate multi-branch ensembles efficiently. The method is architecture-agnostic, compatible with both U-Net and DiT diffusion backbones, and improves reconstruction fidelity for images and videos while preserving editing capabilities and reducing computational burden. Empirical results on PIE and DAVIS demonstrate competitive or superior performance with substantially lower time and memory requirements, making FreeInv well-suited for video inversion and editing workflows.

Abstract

Naive DDIM inversion process usually suffers from a trajectory deviation issue, i.e., the latent trajectory during reconstruction deviates from the one during inversion. To alleviate this issue, previous methods either learn to mitigate the deviation or design cumbersome compensation strategy to reduce the mismatch error, exhibiting substantial time and computation cost. In this work, we present a nearly free-lunch method (named FreeInv) to address the issue more effectively and efficiently. In FreeInv, we randomly transform the latent representation and keep the transformation the same between the corresponding inversion and reconstruction time-step. It is motivated from a statistical perspective that an ensemble of DDIM inversion processes for multiple trajectories yields a smaller trajectory mismatch error on expectation. Moreover, through theoretical analysis and empirical study, we show that FreeInv performs an efficient ensemble of multiple trajectories. FreeInv can be freely integrated into existing inversion-based image and video editing techniques. Especially for inverting video sequences, it brings more significant fidelity and efficiency improvements. Comprehensive quantitative and qualitative evaluation on PIE benchmark and DAVIS dataset shows that FreeInv remarkably outperforms conventional DDIM inversion, and is competitive among previous state-of-the-art inversion methods, with superior computation efficiency.

Paper Structure

This paper contains 21 sections, 11 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Illustration of different ways to mitigate the trajectory deviation in DDIM inversion. The small image above the networks denotes the input. (a) Null-text Inversion mokady2022null reduces the mismatch error via optimizing a null-text embedding. (b) PnP Inversion ju2024pnp saves the reconstruction error of each step in memory and makes a compensation during the reconstruction or editing process. (c) FreeInv improves the DDIM inversion by applying random transformation (e.g. rotation) to the input latent, with negligible time or memory costs.
  • Figure 2: Detailed illustration of FreeInv. We employ rotation $Rot(\cdot,\cdot)$ as the transformation $f(\cdot)$ for example. During both the inversion and reconstruction phases, we rotate the latent representation with the same angle $\psi_t$ at the $t$-th time-step, where $\psi_t$ is randomly sampled.
  • Figure 3: Qualitative comparison. We integrate FreeInv into PnP pnpDiffusion2023, MasaCtrl cao2023masactrl, and P2P hertz2022prompt, respectively. We compare the reconstruction and editing results w/ or w/o FreeInv.
  • Figure 4: Visualization of the reconstructed images of different approaches with FLUX.
  • Figure 5: Qualitative comparison. We select Prompt-to-Prompt (P2P) as the baseline editing framework, and compare the editing results with different inversion approaches. The source and target prompts are provided below each row of the images.
  • ...and 8 more figures