FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion
Yufan Zhou, Haoyu Shen, Huan Wang
TL;DR
FreeBlend tackles concept blending in diffusion models by enabling training-free blending through unCLIP-conditioned image inputs, a three-stage denoising pipeline, and a feedback-driven latent interpolation mechanism. The method introduces two core components: (i) transferred unCLIP image conditions to guide diffusion without text prompts, and (ii) a stepwise interpolation with a feedback loop that gradually increases the blending latent’s influence while updating auxiliary latents to preserve both concepts. Key contributions include the three-stage generation framework, a mathematically defined interpolation and feedback scheme (e.g., $p = 1 - \frac{t}{T}$, $L'^{(t)}_{b} = p L^{(t)}_{b} + \lambda \sum_{n=1}^{N} \gamma_n L_a^{(t,n)}$, and $L'^{(t,k)}_{a}$ updates), and extensive qualitative and quantitative validation showing state-of-the-art blending performance on CTIB/CTIR benchmarks. The work demonstrates that image-conditioned diffusion with feedback interpolation can produce visually coherent and semantically rich blends, with practical impact for creative content generation while maintaining a training-free and flexible workflow; it also addresses cross-modal gaps by leveraging unCLIP embeddings and proposes future directions for multi-concept blending and bias mitigation.
Abstract
Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.
