Table of Contents
Fetching ...

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

Guandong Li, Mengxia Ye

TL;DR

This paper reveals that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, and proposes Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key and Value channels.

Abstract

Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

TL;DR

This paper reveals that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, and proposes Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key and Value channels.

Abstract

Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
Paper Structure (34 sections, 13 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 13 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of DCAG. The Key channel controls attention routing (coarse, nonlinear) and the Value channel controls feature aggregation (fine, linear). Both channels apply bias-delta rescaling independently before the joint attention computation.
  • Figure 2: V-channel sweep at fixed $\delta_k = 1.10$. LPIPS (blue, left axis) improves monotonically up to $\delta_v = 1.15$, then saturates. SSIM (red, right axis) follows the same pattern. The sweet spot at $\delta_v \approx 1.15$ reflects the linear nature of the Value channel.
  • Figure 3: Per-category LPIPS comparison between K-only ($\delta_k\!=\!1.10$) and DCAG ($\delta_k\!=\!1.10, \delta_v\!=\!1.15$). Top: absolute LPIPS values. Bottom: relative improvement (%). DCAG improves 8 out of 10 categories, with the largest gains in Delete Object ($\downarrow$4.3%) and Change Background ($\downarrow$4.2%).
  • Figure 4: Delta-to-bias ratio heatmaps across 60 layers (x-axis) and 24 denoising steps (y-axis). Left: Key space. Right: Value space. Both spaces exhibit pervasive bias-delta structure, with Value ratios (mean 2.45) consistently higher than Key ratios (mean 1.79). The weak correlation ($r = -0.17$) between the two confirms their structural independence.
  • Figure 5: Qualitative comparison across editing categories. DCAG ($\delta_k\!=\!1.10, \delta_v\!=\!1.15$) better preserves non-edited regions compared to K-only ($\delta_k\!=\!1.10$), while maintaining editing quality. K-only ($\delta_k\!=\!1.15$) achieves higher fidelity but at the cost of reduced editing strength.