Table of Contents
Fetching ...

HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee

TL;DR

HeadRouter is introduced, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs and a dual-token refinement module is proposed to refine text/image token representations for precise semantic guidance and accurate region expression.

Abstract

Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.

HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

TL;DR

HeadRouter is introduced, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs and a dual-token refinement module is proposed to refine text/image token representations for precise semantic guidance and accurate region expression.

Abstract

Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.

Paper Structure

This paper contains 23 sections, 17 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Analysis of multi-head attention in MM-DiTs. We illustrate the distribution of distinct semantics across attention heads. Dropping the most influential head leads to significant shifts in associated semantics while swapping the output features of this head enables targeted semantic injection to a certain degree.
  • Figure 2: Analysis of text guidance on image tokens. Key image regions influenced by text guidance are identified within the joint self-attention map and visualized. Additionally, we observe that text guidance influence diminishes as attention blocks progress in the denoising steps, leading to weakened semantic representation from text-guided editing.
  • Figure 3: Pipeline of our method. We mainly introduce instance-adaptive attention head router (IARouter) to adaptively activate attention heads based on their semantic sensitivity, enabling a more accurate representation of the edited specific images.
  • Figure 4: Radar chart for evaluating image and prompt alignments in eight editing tasks. Overall, our approach effectively retains the intrinsic feature of the original image while aligning precisely with the specified text guidance.
  • Figure 5: Qualitative comparison with baseline methods on various editing tasks. Our results demonstrate high alignment with the text guidance while keeping consistency with the reference image.
  • ...and 2 more figures