Controllable Textual Inversion for Personalized Text-to-Image Generation
Jianan Yang, Haobo Wang, Yanming Zhang, Ruixuan Xiao, Sai Wu, Gang Chen, Junbo Zhao
TL;DR
Controllable Textual Inversion (COTI) addresses the data inefficiency and brittleness of traditional Textual Inversion for personalized text-to-image generation by jointly optimizing the new concept embedding and an actively expanded training set. It introduces a theoretically guided MMSE-inspired framework with a dual scoring system (aesthetics and concept-matching) and a dynamic training schedule to select high-quality data from web sources and train embeddings. Empirically, COTI achieves substantial gains in FID and R-precision over TI baselines and demonstrates robust, cycle-wise refinement of concept representations without manual labeling. The approach offers a practical, scalable path to deploying personalized T2I at scale, with clear guidance for integrating automatic data acquisition and adaptive optimization.
Abstract
The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.
