Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis
Young-Beom Woo
TL;DR
This paper addresses the challenge of integrating multiple personalized concepts into a single text-to-image scene without additional tuning. It introduces PnP-MIX, a two-stage, tuning-free framework that uses background-aware masking and latent inversion, augmented by guided appearance attention, mask-guided noise mixing, and background dilution++ to prevent concept leakage and background distortion. Extensive experiments and ablations show substantial improvements over tuning-based and tuning-free baselines in both single- and multi-concept scenarios, with strong quantitative and user-evaluated evidence. The work offers a practical, scalable approach for high-fidelity, multi-concept T2I synthesis with broad applicability in real-world content creation.
Abstract
Integrating multiple personalized concepts into a single image has recently become a significant area of focus within Text-to-Image (T2I) generation. However, existing methods often underperform on complex multi-object scenes due to unintended alterations in both personalized and non-personalized regions. This not only fails to preserve the intended prompt structure but also disrupts interactions among regions, leading to semantic inconsistencies. To address this limitation, we introduce plug-and-play multi-concept adaptive blending for high-fidelity text-to-image synthesis (PnP-MIX), an innovative, tuning-free approach designed to seamlessly embed multiple personalized concepts into a single generated image. Our method leverages guided appearance attention to faithfully reflect the intended appearance of each personalized concept. To further enhance compositional fidelity, we present a mask-guided noise mixing strategy that preserves the integrity of non-personalized regions such as the background or unrelated objects while enabling the precise integration of personalized objects. Finally, to mitigate concept leakage, i.e., the inadvertent leakage of personalized concept features into other regions, we propose background dilution++, a novel strategy that effectively reduces such leakage and promotes accurate localization of features within personalized regions. Extensive experimental results demonstrate that PnP-MIX consistently surpasses existing methodologies in both single- and multi-concept personalization scenarios, underscoring its robustness and superior performance without additional model tuning.
