Table of Contents
Fetching ...

Controllable Textual Inversion for Personalized Text-to-Image Generation

Jianan Yang, Haobo Wang, Yanming Zhang, Ruixuan Xiao, Sai Wu, Gang Chen, Junbo Zhao

TL;DR

Controllable Textual Inversion (COTI) addresses the data inefficiency and brittleness of traditional Textual Inversion for personalized text-to-image generation by jointly optimizing the new concept embedding and an actively expanded training set. It introduces a theoretically guided MMSE-inspired framework with a dual scoring system (aesthetics and concept-matching) and a dynamic training schedule to select high-quality data from web sources and train embeddings. Empirically, COTI achieves substantial gains in FID and R-precision over TI baselines and demonstrates robust, cycle-wise refinement of concept representations without manual labeling. The approach offers a practical, scalable path to deploying personalized T2I at scale, with clear guidance for integrating automatic data acquisition and adaptive optimization.

Abstract

The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.

Controllable Textual Inversion for Personalized Text-to-Image Generation

TL;DR

Controllable Textual Inversion (COTI) addresses the data inefficiency and brittleness of traditional Textual Inversion for personalized text-to-image generation by jointly optimizing the new concept embedding and an actively expanded training set. It introduces a theoretically guided MMSE-inspired framework with a dual scoring system (aesthetics and concept-matching) and a dynamic training schedule to select high-quality data from web sources and train embeddings. Empirically, COTI achieves substantial gains in FID and R-precision over TI baselines and demonstrates robust, cycle-wise refinement of concept representations without manual labeling. The approach offers a practical, scalable path to deploying personalized T2I at scale, with clear guidance for integrating automatic data acquisition and adaptive optimization.

Abstract

The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.
Paper Structure (25 sections, 2 theorems, 12 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 2 theorems, 12 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.2

Let $L_{LDM}$ be the LDM loss following Eq. eq:ldm_loss, and the image space lies within $\mathbb{R}^n$. Then there always exists an ideal random vector $\widehat{\mathbf{X}}_G\in\mathbb{R}^n$ and a condition vector $v^*$ within the text embedding space for LDM, such that

Figures (6)

  • Figure 1: Comparison of generated images on different versions of personalized text-to-image generation. This figure shows a comparison of (a) images generated without textual inversion embedding, (b) images generated with TI trained on 100 randomly selected web data, (c) images generated with TI trained on 1000 web data, (d) images generated with COTI with 100 automatically-selected data. The experiments are conducted on the publicly available Stable Diffusion 2.0 with the concepts “emperor penguin chicks” and “axolotl”.
  • Figure 2: Left: Vanilla TI learns from a high-quality but small image set $\mathbf{X}_T$. Right: COTI alternatively perform data selection and scheduled embedding training. In each training cycle, COTI calculates a weighted score based on the current state of text embedding and then selects new samples to expand $\mathbf{X}_T$. Afterward, the text embedding is trained via a dynamically scheduled procedure. The above two steps alternatively proceed until convergence. We show the approach to apply COTI to a diffusion-based text-to-image model.
  • Figure 3: A comparison of different frameworks. Specifically, we compare three lines of works: (1) TI (RAND), in which the embedding is trained with TI and 200 randomly selected samples; (2) TI trained with all data (1000 samples) within the dataset; (3) COTI trained with carefully selected data with a budget of 200. Embeddings obtained from randomly sampled data fail to produce the features for the newly-given concepts, while those trained on full data contain necessary details but inevitably contain disruptive details.
  • Figure 4: Generative results as cycle proceeds. Samples are generated with COTI on cycles from 1 to 7. In each cycle, we select and add 10 high-quality samples. Generative samples start to converge and contain the right details within the original concept after cycle 4 or 5. We can also see that those generative results contain diverse contents within the background based on the few images given.
  • Figure 5: How the scores change on COTI as cycle proceeds on emperor penguin(chick). (a), (b) shows how aesthetic/concept-matching/comprehensive scores change across different cycles on COTI with or without a dynamic schedule, respectively.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 3.1
  • Theorem 3.2
  • Theorem 3.3