Table of Contents
Fetching ...

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

TL;DR

Diff-Aid is proposed, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps and yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising.

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

TL;DR

Diff-Aid is proposed, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps and yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising.

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.
Paper Structure (32 sections, 8 equations, 14 figures, 8 tables)

This paper contains 32 sections, 8 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: We present Diff-Aid to adaptively improve the interaction between textual conditions and image latents during test time. Benefited from our plug-in design, Diff-Aid enables improved prompts following and synthesis quality for (a) Text-to-image generation, (b) adding conditional input for controllable generation, (c) integrating off-the-shelf LoRAs and (d) zero-shot instructional image editing.
  • Figure 2: Attention visualization of different blocks across denoising steps. If the interaction between textual conditions and image latents is insufficient, the rendered image may fail to faithfully reflect given conditions.
  • Figure 3: Overview of our proposed Aid blocks. (a) The original diffusion transformers map a random latent and input textual conditions to output features. (b) Our Aid blocks adaptively modulate textual conditions with respect to denoising timesteps across different transformer blocks, enhancing the interaction between image latents and textual conditions.
  • Figure 4: Distributional visualization of the learned adaptive weights $\alpha$ across various (a) diffusion transformer blocks , (b) textual tokens and (c) denoising timesteps. (d) Visualization of the attention norm across various timesteps. Our Diff-Aid learns to adaptively adjust block-wise interactions between token-level textual conditions across denoising trajectories, while also capturing the intrinsic attributes of the baseline models to enable tighter and more faithful conditioning during generation.
  • Figure 5: Qualitative comparisons of images generated by baseline FLUX and SD 3.5 with and without our proposed Diff-Aid. The quantitative results below each image shows the scores of HPSv3, Image Reward and Aesthetic. Our method demonstrates superior prompt adherence and image quality. More qualitative results are given in appendix.
  • ...and 9 more figures