Table of Contents
Fetching ...

LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

Jian Jin, Zhenbo Yu, Yang Shen, Zhenyong Fu, Jian Yang

TL;DR

LaTexBlend addresses the challenge of scaling customized text-to-image generation to multiple concepts without sacrificing quality or efficiency. It introduces a latent textual space and a concept bank to store compact concept representations, enabling seamless on-the-fly blending of many concepts at inference time with no additional tuning. Empirical results show superior concept fidelity and layout coherence compared with strong baselines, along with linear tuning costs and no extra inference burden as the concept count grows. The approach enhances practical applicability of personalized T2I generation and supports layout-conditioned and complex scene generation, with potential for broader creative tooling.

Abstract

Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.

LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

TL;DR

LaTexBlend addresses the challenge of scaling customized text-to-image generation to multiple concepts without sacrificing quality or efficiency. It introduces a latent textual space and a concept bank to store compact concept representations, enabling seamless on-the-fly blending of many concepts at inference time with no additional tuning. Empirical results show superior concept fidelity and layout coherence compared with strong baselines, along with linear tuning costs and no extra inference burden as the concept count grows. The approach enhances practical applicability of personalized T2I generation and supports layout-conditioned and complex scene generation, with potential for broader creative tooling.

Abstract

Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.

Paper Structure

This paper contains 35 sections, 11 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: LaTexBlend customizes each personal subject individually and stores it in a concept bank using a compact representation. At inference, multiple concepts from the bank can be combined seamlessly for multi-concept customized generation without requiring additional tuning. LaTexBlend simultaneously addresses two key challenges in scaling multi-concept generation: ensuring high generation quality (including concept fidelity and layout coherence) and maintaining computational efficiency.
  • Figure 2: Structure degradation and denoising deviation in customized generation.(a): The images in (0) are generated by the pre-trained model (with 3 different initializations), which are highly aligned with the prompt; images (1)–(3) are generated by Customized Diffusion kumari2023multi. In (1)-(3) of Fig. (a), we progressively add the customized concepts - "$\text{V}_1^*$ dog", "$\text{V}_2^*$ castle", and "$\text{V}_3^*$ sunglasses" - to the generation, using the same initializations as in (0). (b): We use the magnitude proposed in wen2024detecting to reflect the deviation in image structure. Customized generation deviates from the normal denoising process of the pre-trained model, resulting in degraded and memorized layouts (typically single-object-centric). This issue worsens as the number of concepts increases.
  • Figure 3: Overall framework of the proposed LaTexBlend.(a)LaTexBlend customizes each concept individually and stores them in a concept bank with a compact representation of latent textual features. (b) At inference, concepts from the bank can be seamlessly combined in the latent textual space on the fly for multi-concept generation, without needing any additional tuning.
  • Figure 4: Mitigation of image structure degradation. The images in (0) are generated by the pre-trained model with 3 different initializations. In images (1)-(3), we progressively blend customized concepts - "$\text{V}_1^*$ dog", "$\text{V}_2^*$ castle", and "$\text{V}_3^*$ sunglasses" - into the context of image (0). LaTexBlend effectively mitigates structure degradation caused by customized concepts, blending high-fidelity subject appearances while maintaining coherent layouts.
  • Figure 5: Comparison of fine-tuning costs. The fine-tuning cost of LaTexBlend increases linearly as the number of concepts grows. Mix-of-Show gu2024mix requires extra tuning for different concept combinations. Although OMG kong2024omg is also efficient for fine-tuning, its inference-time cost is high.
  • ...and 16 more figures