LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending
Jian Jin, Zhenbo Yu, Yang Shen, Zhenyong Fu, Jian Yang
TL;DR
LaTexBlend addresses the challenge of scaling customized text-to-image generation to multiple concepts without sacrificing quality or efficiency. It introduces a latent textual space and a concept bank to store compact concept representations, enabling seamless on-the-fly blending of many concepts at inference time with no additional tuning. Empirical results show superior concept fidelity and layout coherence compared with strong baselines, along with linear tuning costs and no extra inference burden as the concept count grows. The approach enhances practical applicability of personalized T2I generation and supports layout-conditioned and complex scene generation, with potential for broader creative tooling.
Abstract
Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.
