Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li; Mengping Yang; Zhiyu Tan; Junping Zhang; Hao Li

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

TL;DR

Diff-Aid is proposed, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps and yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising.

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 14 figures, 8 tables)

This paper contains 32 sections, 8 equations, 14 figures, 8 tables.

Introduction
Methodology
Systematic Analysis of Design Motivations
Sufficient interactions lead to improved prompt adherence.
The Proposed Diff-Aid
Training and Inference
Experiments
Experiment Settings
Empirical Observation of Learned $\alpha$
Main Results
Ablation Studies
Generalized Downstream Applications
Related Works
Conclusion
Impact Statement
...and 17 more sections

Figures (14)

Figure 1: We present Diff-Aid to adaptively improve the interaction between textual conditions and image latents during test time. Benefited from our plug-in design, Diff-Aid enables improved prompts following and synthesis quality for (a) Text-to-image generation, (b) adding conditional input for controllable generation, (c) integrating off-the-shelf LoRAs and (d) zero-shot instructional image editing.
Figure 2: Attention visualization of different blocks across denoising steps. If the interaction between textual conditions and image latents is insufficient, the rendered image may fail to faithfully reflect given conditions.
Figure 3: Overview of our proposed Aid blocks. (a) The original diffusion transformers map a random latent and input textual conditions to output features. (b) Our Aid blocks adaptively modulate textual conditions with respect to denoising timesteps across different transformer blocks, enhancing the interaction between image latents and textual conditions.
Figure 4: Distributional visualization of the learned adaptive weights $\alpha$ across various (a) diffusion transformer blocks , (b) textual tokens and (c) denoising timesteps. (d) Visualization of the attention norm across various timesteps. Our Diff-Aid learns to adaptively adjust block-wise interactions between token-level textual conditions across denoising trajectories, while also capturing the intrinsic attributes of the baseline models to enable tighter and more faithful conditioning during generation.
Figure 5: Qualitative comparisons of images generated by baseline FLUX and SD 3.5 with and without our proposed Diff-Aid. The quantitative results below each image shows the scores of HPSv3, Image Reward and Aesthetic. Our method demonstrates superior prompt adherence and image quality. More qualitative results are given in appendix.
...and 9 more figures

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

TL;DR

Abstract

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)