Table of Contents
Fetching ...

$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang

TL;DR

This work tackles the high computational burden of multi-subject personalized text-to-image generation by proposing λ-ECLIPSE, a diffusion-free prior that operates in the CLIP latent space. It trains a compact 34M-parameter transformer via image-text interleaved data and optionally includes Canny edge conditioning, enabling fast, multi-subject generation that plugs into a frozen diffusion UNet. Through Dreambench, ConceptBed, and Multibench benchmarks, λ-ECLIPSE achieves competitive concept and composition alignment with far fewer parameters and GPU hours, and uniquely supports multi-subject interpolation by leveraging the CLIP latent space. The approach demonstrates a practical, resource-efficient path toward integrating subject-driven image generation with CLIP-based priors, while also enabling targeted edge-guided control and smooth concept blending.

Abstract

Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal Large Language Models (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of training. These subject-driven T2I methods hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. In this paper, we present $λ$-ECLIPSE, an alternative prior-training strategy that works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models. $λ$-ECLIPSE leverages the image-text interleaved pre-training for fast and effective multi-subject-driven P-T2I. Through extensive experiments, we establish that $λ$-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization. $λ$-ECLIPSE performs multi-subject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours. Additionally, $λ$-ECLIPSE demonstrates the unique ability to perform multi-concept interpolations.

$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

TL;DR

This work tackles the high computational burden of multi-subject personalized text-to-image generation by proposing λ-ECLIPSE, a diffusion-free prior that operates in the CLIP latent space. It trains a compact 34M-parameter transformer via image-text interleaved data and optionally includes Canny edge conditioning, enabling fast, multi-subject generation that plugs into a frozen diffusion UNet. Through Dreambench, ConceptBed, and Multibench benchmarks, λ-ECLIPSE achieves competitive concept and composition alignment with far fewer parameters and GPU hours, and uniquely supports multi-subject interpolation by leveraging the CLIP latent space. The approach demonstrates a practical, resource-efficient path toward integrating subject-driven image generation with CLIP-based priors, while also enabling targeted edge-guided control and smooth concept blending.

Abstract

Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal Large Language Models (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of training. These subject-driven T2I methods hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. In this paper, we present -ECLIPSE, an alternative prior-training strategy that works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models. -ECLIPSE leverages the image-text interleaved pre-training for fast and effective multi-subject-driven P-T2I. Through extensive experiments, we establish that -ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization. -ECLIPSE performs multi-subject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours. Additionally, -ECLIPSE demonstrates the unique ability to perform multi-concept interpolations.
Paper Structure (38 sections, 4 equations, 14 figures, 9 tables)

This paper contains 38 sections, 4 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: $\lambda$-ECLIPSE can estimate subject-specific image embeddings while maintaining the balance between concept and composition alignment in a resource-efficient way.
  • Figure 2: This figure illustrates the three stages of the $\lambda$-ECLIPSE pipeline. 1) Create the image-text interleaved features using frozen CLIP. 2) Pre-train the $\lambda$-ECLIPSE (34M parameters) using Eq. \ref{['eq:eclipse']}, which ensures the mapping to the desired latent space given the image-text interleaved data. 3) During inference, the frozen Kandinsky v2.2 diffusion UNet model takes the output from the $\lambda$-ECLIPSE and generates the image.
  • Figure 3: CLIP(vision) features capture the semantics and fine-grained visual details. Each input is given as input to the Kandinsky v2.2 and re-generated from the decoder. (Top: Real-images, Bottom: Canny edge)
  • Figure 4: This figure illustrates a qualitative comparison of $\lambda$-ECLIPSE with contemporary approaches for single-subject T2I generations, utilizing concepts and prompts from the Dreambench dataset. For each method, concept, and prompt, we generate four images and select the one that most accurately represents the queried concept and composition.
  • Figure 5: Qualitative results categorized by generative capabilities.
  • ...and 9 more figures