Table of Contents
Fetching ...

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

Weili Zeng, Yichao Yan, Qi Zhu, Zhuo Chen, Pengzhi Chu, Weiming Zhao, Xiaokang Yang

TL;DR

This work addresses concept overfitting in text-to-image customization by delineating concept-agnostic and concept-specific overfitting and introducing two metrics, Latent Fisher divergence and 2-Wasserstein distance, to quantify them. It proposes Infusion, a lightweight method that preserves the original cross-attention while learning residual value embeddings, effectively injecting target concepts through a dual-stream, plug-and-play approach with only about 11 KB of learnable parameters. By decoupling the attention maps from value features and reusing foundational model maps during customization, Infusion achieves robust single- and multi-concept generation with improved text alignment and conceptual fidelity while maintaining diversity. Extensive experiments on SD-v1.5 show Infusion outperforms state-of-the-art baselines and demonstrates strong resistance to both overfitting types and modality collapse, indicating practical impact for flexible, efficient T2I customization.

Abstract

Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

TL;DR

This work addresses concept overfitting in text-to-image customization by delineating concept-agnostic and concept-specific overfitting and introducing two metrics, Latent Fisher divergence and 2-Wasserstein distance, to quantify them. It proposes Infusion, a lightweight method that preserves the original cross-attention while learning residual value embeddings, effectively injecting target concepts through a dual-stream, plug-and-play approach with only about 11 KB of learnable parameters. By decoupling the attention maps from value features and reusing foundational model maps during customization, Infusion achieves robust single- and multi-concept generation with improved text alignment and conceptual fidelity while maintaining diversity. Extensive experiments on SD-v1.5 show Infusion outperforms state-of-the-art baselines and demonstrates strong resistance to both overfitting types and modality collapse, indicating practical impact for flexible, efficient T2I customization.

Abstract

Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.
Paper Structure (19 sections, 9 equations, 11 figures, 2 tables)

This paper contains 19 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Example of concept-agnostic overfitting. The first column on the left is the target concept, and the right is the non-customized generation results of different methods. The generated "cat" consistently exhibits black spotted stripes, while the generated "teddy" consistently presents a doll-like form, both sharing similar backgrounds.
  • Figure 2: Example of concept-specific overfitting. Customization with the prompt "a photo of a $\langle \text{cat} \rangle$", we reveal that all prior methods generate cat images with a similar size, pose, or background to the training data.
  • Figure 3: Infusion pipline. (a) Infusion fully preserves the generative capacity of the original model, precluding concept-agnostic overfitting. (b) Infusion decouples the cross-attention module, replacing the attention maps in the customized pipeline with those from the foundational pipeline, thereby leveraging the modality diversity of the original model to mitigate concept-specific overfitting.
  • Figure 4: Concept-agnostic overfitting. We use a four-peak hybrid Gaussian distribution to represent the foundational model distribution and illustrate that customized tuning, as observed in Dreambooth dreambooth and Custom Diffusion custom, undermines non-customized generative capabilities.
  • Figure 5: Concept-specific overfitting. We use a 25-peak hybrid Gaussian distribution to represent the modalities in foundational model distribution under a super-class concept. The confusion training between customized concepts and limited modalities gradually reduces the number of original modalities.
  • ...and 6 more figures