Table of Contents
Fetching ...

Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing

Hui Liu, Bin Zou, Suiyun Zhang, Kecheng Chen, Rui Liu, Haoliang Li

TL;DR

This work proposes Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models, by analyzing self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.

Abstract

Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.

Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing

TL;DR

This work proposes Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models, by analyzing self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.

Abstract

Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison of our proposed Instruction Influence Disentanglement (IID) framework with step-by-step editing and compositing all instructions into a single one for multi-instruction image editing.
  • Figure 2: Illustration of our proposed IID framework. $T$ denotes the total number of diffusion steps, while $S$ represents the pre-defined step for mask generation and multi-instruction influence disentanglement. $\bar{z}_{S}[M_i]$ corresponds to the token sequence of the noised image $\bar{z}_{S}$ associated with the mask $M_i$ for the instruction $P_i$ (ideally representing the tokens pertinent to the editing area specified by $P_i$).
  • Figure 3: The visualization of attention map between the instruction tokens and noise image tokens $\bar{A}_{ZP}$ and among noise image tokens $\bar{A}_{ZZ}$. Attention weights are extracted from the penultimate layer. "Avg" represents the averaging attention map across all heads.
  • Figure 4: Qualitative comparisons. The top two rows of images are based on FluxEdit, while the bottom two rows are based on Omnigen. Single results represent use the one instruction to edit the input image.
  • Figure 5: Ablation study on the influence of timesteps on the attention maps of the penultimate layer of both models. For $A_{ZP}$ of Omnigen, we choose the attention head with highest activation to the editing region for displaying.
  • ...and 3 more figures