Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee; SeungJu Cha; Yebin Ahn; Hyunwoo Oh; Sungho Koh; Dong-Jin Kim

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim

Abstract

Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Abstract

Paper Structure (30 sections, 2 theorems, 41 equations, 17 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 2 theorems, 41 equations, 17 figures, 12 tables, 1 algorithm.

Introduction
Related Works
Backgrounds
Diffusion Models
Tweedie’s Formula in Diffusion Models
Classifier-Free Guidance (CFG)
Method
Posterior Mean Alignment
Closed-Form Adaptive Coefficient
Experiments
Implementation Details.
Main Results of AAPB
Ablation Studies
Conclusion
Theoretical Extension: Log-Concave Setting.
...and 15 more sections

Key Result

Proposition 1

Consider two distributions: $q_{\gamma_t}$, which uses a fixed coefficient $\gamma_t$ at timestep $t$, and $q_{\mathrm{proj}}$, which adaptively projects $\gamma_t^*$ to minimize local score error. Suppose $p_T$ is $k$-strongly log-concave and satisfies the transport-information inequality OTTO20003 where $k > 0$ is the strong log-concavity constant, and $J(q\|p) := \mathbb{E}_{q}\!\left[\|\nabla

Figures (17)

Figure 1: When the target concept lies in a low-density region, the generated samples tend to drift toward semantically dominant, high-density concepts um2023don in the learned score space, resulting in the suppression of rare or compositional attributes. Our proposed adaptive coefficient $\gamma_t^{*}$, derived from auxiliary prompt blending between $\tilde{c}_T$ and its anchor $\tilde{c}_A$, dynamically corrects this bias and produces target-faithful results. Unlike a fixed coefficient $\gamma_t$, the adaptive $\gamma_t^{*}$ adjusts per timestep to maintain a target-aligned denoising trajectory.
Figure 2: Toy Example of the target concept generation. (a) Training distributions: frequent samples $\mathcal{N}((0, -6), I)$ (orange) and rare-prior samples $\mathcal{N}((0, 3), 1.5I)$ (purple) with mean positions marked by crosses; (b) Generated samples using fixed linear interpolation $p_{\text{lerp}}(x|\tilde{c}_A, \tilde{c}_T; \gamma_t = 0.8)$; (c) Generated samples using adaptive interpolation $p(x| \tilde{c}_A,\tilde{c}_T; \gamma^*_t)$; (d) 2-Wasserstein distance between generated distributions and the target $\mathcal{N}((0, 3), 1.5I)$ as a function of interpolation parameter $\gamma_t$ (blue line). The distance curve shows that while fixed interpolation achieves a minimum distance around $\gamma_t \approx 0.8$, the adaptive method (red dashed line) consistently outperforms any fixed interpolation choice.
Figure 3: Qualitative comparison with state-of-the-art diffusion models on RareBench. All models are executed with the same random seed. Our method achieves stronger text-to-image alignment without additional training.
Figure 4: Qualitative comparison of image editing results using FlowEdit kulikov2024flowedit and our method. All edits are performed with the same random seed. Compared to FlowEdit, our approach better preserves source content while faithfully applying the instructed edits.
Figure 5: Comparison between fixed $\gamma_t$, R2F park2024rare, and our adaptive coefficient on RareBench. The blue line denotes fixed $\gamma_t$, the green line denotes R2F, and the red line represents our adaptive approach, which consistently outperforms the fixed baseline.
...and 12 more figures

Theorems & Definitions (4)

Proposition 1: Extension to the log-concave case
proof
Proposition 2: Optimal Adaptive Coefficient
proof

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Abstract

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Authors

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)