Table of Contents
Fetching ...

Group Relative Attention Guidance for Image Editing

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu

TL;DR

This work investigates diffusion-in-transformer editors (MM-DiT) and reveals a layer-specific bias in query and key embeddings that encodes a default editing action. It introduces Group Relative Attention Guidance (GRAG), a lightweight mechanism that modulates token deviations from a group bias to achieve continuous, fine-grained control over editing strength with minimal overhead. By applying simple, four-line code changes, GRAG improves editing fidelity and controllability across multiple MM-DiT-based editors, outperforming standard classifier-free guidance in smoothness and precision. The approach provides new insight into internal attention dynamics and offers a practical path to more controllable image editing in diffusion-based architectures.

Abstract

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

Group Relative Attention Guidance for Image Editing

TL;DR

This work investigates diffusion-in-transformer editors (MM-DiT) and reveals a layer-specific bias in query and key embeddings that encodes a default editing action. It introduces Group Relative Attention Guidance (GRAG), a lightweight mechanism that modulates token deviations from a group bias to achieve continuous, fine-grained control over editing strength with minimal overhead. By applying simple, four-line code changes, GRAG improves editing fidelity and controllability across multiple MM-DiT-based editors, outperforming standard classifier-free guidance in smoothness and precision. The approach provides new insight into internal attention dynamics and offers a practical path to more controllable image editing in diffusion-based architectures.

Abstract

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

Paper Structure

This paper contains 26 sections, 12 equations, 30 figures, 2 tables, 1 algorithm.

Figures (30)

  • Figure 1: Variation of editing strength with respect to the relative attention guidance scale. Our approach enables continuous and fine-grained control of editing strength, striking a user-aligned balance between instruction following and consistency of original image.
  • Figure 2: The visualization of the embedding features input to the attention layer, where a significant bias can be observed across different tokens.
  • Figure 3: Group Relative Attention Guidance.
  • Figure 4: (Kontext-Layer 2) Aggregating different tokens along the sequence dimension, we visualize the embedding features across the dimension and head axes. The visual features are concentrated at positions corresponding to high RoPE frequencies, while textual features are associated with low frequencies.
  • Figure 5: (Kontext-Layer 2) Mean vector magnitudes and standard deviations across different attention heads. A significant bias vector exists in the embedding space.
  • ...and 25 more figures