Table of Contents
Fetching ...

FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion

Yufan Zhou, Haoyu Shen, Huan Wang

TL;DR

FreeBlend tackles concept blending in diffusion models by enabling training-free blending through unCLIP-conditioned image inputs, a three-stage denoising pipeline, and a feedback-driven latent interpolation mechanism. The method introduces two core components: (i) transferred unCLIP image conditions to guide diffusion without text prompts, and (ii) a stepwise interpolation with a feedback loop that gradually increases the blending latent’s influence while updating auxiliary latents to preserve both concepts. Key contributions include the three-stage generation framework, a mathematically defined interpolation and feedback scheme (e.g., $p = 1 - \frac{t}{T}$, $L'^{(t)}_{b} = p L^{(t)}_{b} + \lambda \sum_{n=1}^{N} \gamma_n L_a^{(t,n)}$, and $L'^{(t,k)}_{a}$ updates), and extensive qualitative and quantitative validation showing state-of-the-art blending performance on CTIB/CTIR benchmarks. The work demonstrates that image-conditioned diffusion with feedback interpolation can produce visually coherent and semantically rich blends, with practical impact for creative content generation while maintaining a training-free and flexible workflow; it also addresses cross-modal gaps by leveraging unCLIP embeddings and proposes future directions for multi-concept blending and bias mitigation.

Abstract

Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.

FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion

TL;DR

FreeBlend tackles concept blending in diffusion models by enabling training-free blending through unCLIP-conditioned image inputs, a three-stage denoising pipeline, and a feedback-driven latent interpolation mechanism. The method introduces two core components: (i) transferred unCLIP image conditions to guide diffusion without text prompts, and (ii) a stepwise interpolation with a feedback loop that gradually increases the blending latent’s influence while updating auxiliary latents to preserve both concepts. Key contributions include the three-stage generation framework, a mathematically defined interpolation and feedback scheme (e.g., , , and updates), and extensive qualitative and quantitative validation showing state-of-the-art blending performance on CTIB/CTIR benchmarks. The work demonstrates that image-conditioned diffusion with feedback interpolation can produce visually coherent and semantically rich blends, with practical impact for creative content generation while maintaining a training-free and flexible workflow; it also addresses cross-modal gaps by leveraging unCLIP embeddings and proposes future directions for multi-concept blending and bias mitigation.

Abstract

Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.

Paper Structure

This paper contains 30 sections, 9 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We introduce FreeBlend, a training-free approach that effectively blends two concepts to generate new objects through feedback interpolation and auxiliary inference. This method consistently produces visually coherent and harmonious blends, enabling users to create customized images with diverse combinations of concepts.
  • Figure 2: Overview of our method. Two input images, generated by Stable Diffusion with respective concepts, are encoded into CLIP embeddings and mapped to a shared text space via the Linear Prior Converter from unCLIP ramesh2022hierarchical. These embeddings condition the U-Net, one for downsampling and the other for upsampling. The blending latent $L_b$ is initialized with Gaussian noise and processed during initialization. The module within the dashed box is used only in the blending stage. Noise $\epsilon$ is added to the image embeddings to generate initial auxiliary latents, which are interpolated into $L^{(t)}_b$ for feedback. The latent $L^{(t)}_a$ is combined with $L'^{(t)}_b$ by proportion $p$. Updated latents $L'^{(t)}_a$ are refined in auxiliary inference using unCLIP embeddings to preserve original features, while $L'^{(t)}_b$ is denoised in the blending inference. Finally, the blending latent is refined and passed to the VAE decoder to generate the final image.
  • Figure 3: At the top are the concepts, and on the left are the methods we compare. Each row shows the results, with three images per method and concept pair, evaluating our method against five blending methods. Unlike others, which can suffer from rigid splicing, discordant compositions, or concept bias, our method smoothly integrates features from different concepts into a cohesive new object.
  • Figure 4: Ablation study on the impact of $\gamma$. The results show that, in the first row, better blending is achieved on the left side, while the right side appears more spliced. In the second row, both sides exhibit a more visually appealing effect. The blending process, however, is inherently subjective, and users can adjust the parameter $\gamma$ to tailor the output according to their preferences. By adjusting $\gamma$, users can control the contribution of each concept, thereby mitigating associated biases.
  • Figure 5: Ablation study of the feedback mechanism: removing the feedback module causes image overlap, disrupting blending and preventing integration.
  • ...and 10 more figures