Table of Contents
Fetching ...

Stable Flow: Vital Layers for Training-Free Image Editing

Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or

TL;DR

This work tackles training-free image editing with Diffusion Transformer (DiT) models by automatically identifying a set of vital layers whose features are crucial for image formation. It introduces an attention-injection mechanism that leverages these vital layers to achieve stable, prompt-consistent edits across a range of tasks, including non-rigid deformations and object manipulation. To extend editing to real images, it couples a novel latent nudging technique with inverse Euler ODE-based inversion for better reconstruction and controlled edits. Extensive qualitative, quantitative, and user studies demonstrate the approach's effectiveness and versatility, with additional demonstrations on real-image editing and potential implications for model pruning and distillation. The work thus provides a training-free, layer-focused pathway for reliable image editing using DiT-based diffusion models.

Abstract

Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow

Stable Flow: Vital Layers for Training-Free Image Editing

TL;DR

This work tackles training-free image editing with Diffusion Transformer (DiT) models by automatically identifying a set of vital layers whose features are crucial for image formation. It introduces an attention-injection mechanism that leverages these vital layers to achieve stable, prompt-consistent edits across a range of tasks, including non-rigid deformations and object manipulation. To extend editing to real images, it couples a novel latent nudging technique with inverse Euler ODE-based inversion for better reconstruction and controlled edits. Extensive qualitative, quantitative, and user studies demonstrate the approach's effectiveness and versatility, with additional demonstrations on real-image editing and potential implications for model pruning and distillation. The work thus provides a training-free, layer-focused pathway for reliable image editing using DiT-based diffusion models.

Abstract

Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow

Paper Structure

This paper contains 27 sections, 4 equations, 35 figures, 5 tables.

Figures (35)

  • Figure 1: Leveraging Reduced Diversity. Using the same initial seed with different editing prompts, diffusion models such as (1) SDXL generate diverse results (different identities of the dog and the cat), while (2) FLUX generates a more stable (less diverse) set of results out-of-the-box. However, there are still some unintended differences (the dog is standing in the leftmost column and sitting in the others, the color of the cat is changing, and the road is different on the right). Using our approach, (3) Stable Flow, the edits are stable, maintaining consistency of the unrelated content.
  • Figure 2: Layer Removal. (Left) Text-to-image DiT models consist of consecutive layers connected through residual connections He2015DeepRL. Each layer implements a multimodal diffusion transformer block Esser2024ScalingRF that processes a combined sequence of text and image embeddings. (Right) For each DiT layer, we perform an ablation by bypassing the layer using its residual connection. Then, we compare the generated result on the ablated model with the complete model using a perceptual similarity metric.
  • Figure 3: Layer Removal Quantitative Comparison. As explained in \ref{['sec:layers_importance']}, we measured the effect of removing each layer of the model by calculating the perceptual similarity between the generated images with and without this layer. Lower perceptual similarity indicates significant changes in the generated images (\ref{['fig:layer_removal_qualitative']}). As can be seen, removing certain layers significantly affects the generated images, while others have minimal impact. Importantly, influential layers are distributed across the transformer rather than concentrated in specific regions. Note that the first vital layers were omitted for clarity (as their perceptual similarity approached zero).
  • Figure 4: Layer Removal Qualitative Comparison. As explained in \ref{['sec:layers_importance']}, we illustrate the qualitative differences between vital and non-vital layers. While bypassing non-vital layers ($G_{5}$ and $G_{52}$) results in minor alterations, bypassing vital layers leads to significant changes: complete noise generation ($G_{0}$), global structure and identity changes ($G_{18}$), and alterations in texture and fine details ($G_{56}$).
  • Figure 5: Multi-Modal Attention Distribution. Given an input image of a man, we edit it to hold an avocado by injecting the reference image attention activations in the vital layers only (left), or in the non-vital layers (right), and visualize the multimodal attention of two points: a yellow point in a region that should remain unchanged (requiring copying from the reference image), and a red point in an area targeted for editing (requiring generation based on the text prompt). As can be seen, in vital layers (left), points meant to remain unchanged show dominant attention to visual features, while points targeted for editing exhibit stronger attention to relevant text tokens (e.g., "avocado"). Conversely, non-vital layers (right) show predominantly image-based attention even in regions marked for editing. This suggests that injecting features into vital layers strikes a good multimodal attention balance between preserving source content and incorporating text-guided modifications.
  • ...and 30 more figures