Table of Contents
Fetching ...

Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis

Young-Beom Woo

TL;DR

This paper addresses the challenge of integrating multiple personalized concepts into a single text-to-image scene without additional tuning. It introduces PnP-MIX, a two-stage, tuning-free framework that uses background-aware masking and latent inversion, augmented by guided appearance attention, mask-guided noise mixing, and background dilution++ to prevent concept leakage and background distortion. Extensive experiments and ablations show substantial improvements over tuning-based and tuning-free baselines in both single- and multi-concept scenarios, with strong quantitative and user-evaluated evidence. The work offers a practical, scalable approach for high-fidelity, multi-concept T2I synthesis with broad applicability in real-world content creation.

Abstract

Integrating multiple personalized concepts into a single image has recently become a significant area of focus within Text-to-Image (T2I) generation. However, existing methods often underperform on complex multi-object scenes due to unintended alterations in both personalized and non-personalized regions. This not only fails to preserve the intended prompt structure but also disrupts interactions among regions, leading to semantic inconsistencies. To address this limitation, we introduce plug-and-play multi-concept adaptive blending for high-fidelity text-to-image synthesis (PnP-MIX), an innovative, tuning-free approach designed to seamlessly embed multiple personalized concepts into a single generated image. Our method leverages guided appearance attention to faithfully reflect the intended appearance of each personalized concept. To further enhance compositional fidelity, we present a mask-guided noise mixing strategy that preserves the integrity of non-personalized regions such as the background or unrelated objects while enabling the precise integration of personalized objects. Finally, to mitigate concept leakage, i.e., the inadvertent leakage of personalized concept features into other regions, we propose background dilution++, a novel strategy that effectively reduces such leakage and promotes accurate localization of features within personalized regions. Extensive experimental results demonstrate that PnP-MIX consistently surpasses existing methodologies in both single- and multi-concept personalization scenarios, underscoring its robustness and superior performance without additional model tuning.

Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis

TL;DR

This paper addresses the challenge of integrating multiple personalized concepts into a single text-to-image scene without additional tuning. It introduces PnP-MIX, a two-stage, tuning-free framework that uses background-aware masking and latent inversion, augmented by guided appearance attention, mask-guided noise mixing, and background dilution++ to prevent concept leakage and background distortion. Extensive experiments and ablations show substantial improvements over tuning-based and tuning-free baselines in both single- and multi-concept scenarios, with strong quantitative and user-evaluated evidence. The work offers a practical, scalable approach for high-fidelity, multi-concept T2I synthesis with broad applicability in real-world content creation.

Abstract

Integrating multiple personalized concepts into a single image has recently become a significant area of focus within Text-to-Image (T2I) generation. However, existing methods often underperform on complex multi-object scenes due to unintended alterations in both personalized and non-personalized regions. This not only fails to preserve the intended prompt structure but also disrupts interactions among regions, leading to semantic inconsistencies. To address this limitation, we introduce plug-and-play multi-concept adaptive blending for high-fidelity text-to-image synthesis (PnP-MIX), an innovative, tuning-free approach designed to seamlessly embed multiple personalized concepts into a single generated image. Our method leverages guided appearance attention to faithfully reflect the intended appearance of each personalized concept. To further enhance compositional fidelity, we present a mask-guided noise mixing strategy that preserves the integrity of non-personalized regions such as the background or unrelated objects while enabling the precise integration of personalized objects. Finally, to mitigate concept leakage, i.e., the inadvertent leakage of personalized concept features into other regions, we propose background dilution++, a novel strategy that effectively reduces such leakage and promotes accurate localization of features within personalized regions. Extensive experimental results demonstrate that PnP-MIX consistently surpasses existing methodologies in both single- and multi-concept personalization scenarios, underscoring its robustness and superior performance without additional model tuning.

Paper Structure

This paper contains 19 sections, 8 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Comparison of multi-subject personalization methods. Leftmost images show the personalized references for the cat and the woman. Remaining images depict the outputs of Custom Diffusion, Perfusion, Mix-of-Show, and our method (PnP-MIX), all conditioned on the prompt: "A photo of a cat and a woman hugging each other, lighthouse background." While existing approaches often cause subjects to disappear, appearances to blend, or structural details to distort, our method preserves each concept’s features and the overall scene context, delivering coherent and high-quality results.
  • Figure 2: Overview of PnP-MIX. Our framework integrates multiple personalized concepts into a single image via a two-stage pipeline. In the first stage, background and object masks are extracted using Grounded SAM ren2024grounded, and Inpaint Anything yu2023inpaint is used to generate a background image with the original object regions removed. In the second stage, latent representations of the inpainted background, background, and each personalized concept, along with their masks, are obtained via DDPM Inversion huberman2024edit and fused using guided appearance attention, mask-guided noise mixing, and Background Dilution++. Unlike tuning-based personalization methods that rely on random background generation, PnP-MIX supports explicit background selection, providing fine-grained control over scene composition.
  • Figure 3: Input preparation pipeline. Given a user-selected background image $I_{\mathrm{back}}$ containing the target concept, we apply Inpaint Anything yu2023inpaint to remove the specified object region and obtain the inpainted background $I_{\mathrm{inpaint}}$. Then, Grounded SAM ren2024grounded generates binary masks $M_{\mathrm{back}}$, $M_{1}$, and $M_{2}$ to segment the background and each personalized concept. These components constitute the inputs fed into the PnP-MIX framework.
  • Figure 4: Edit-Friendly DDPM Inversion and latent cloning. We apply Edit-Friendly DDPM Inversion huberman2024edit to the inpainted background $I_{\mathrm{inpaint}}$, the selected background $I_{\mathrm{back}}$, and the personalized concept images $I_{1}$, $I_{2}$, to obtain their noise codes $z_{T}^{\mathrm{inpaint}}$, $z_{T}^{\mathrm{back}}$, $z_{T}^{\mathrm{per}_1}$, and $z_{T}^{\mathrm{per}_2}$. The background latent $z_{T}^{\mathrm{back}}$ (highlighted in red) is then cloned to produce the output latent $z_{T}^{\mathrm{out}}$ and reference latents $z_{T}^{\mathrm{ref}_1}$, $z_{T}^{\mathrm{ref}_2}$. These duplicated latents guide the subsequent multi-concept fusion stage, ensuring faithful background reconstruction and accurate embedding of each personalized concept.
  • Figure 5: Guided appearance attention. At each self-attention layer, the personal concept latent $z_t^{\mathrm{per}}$ and the corresponding reference latent $z_t^{\mathrm{ref}}$ are processed to inject appearance information from the personal concept into the reference, while preserving spatial structure. The key from $z_t^{\mathrm{ref}}$ is replaced with that from $z_t^{\mathrm{per}}$, while the value is adjusted using value guidance. This mechanism enriches the reference latent with vivid color and texture cues from the personal concept while preserving its spatial structure.
  • ...and 8 more figures