Training-free Editioning of Text-to-Image Models

Jinqi Wang; Yunfei Fu; Zhangcan Ding; Bailin Deng; Yu-Kun Lai; Yipeng Qin

Training-free Editioning of Text-to-Image Models

Jinqi Wang, Yunfei Fu, Zhangcan Ding, Bailin Deng, Yu-Kun Lai, Yipeng Qin

TL;DR

The paper tackles the challenge of creating customized editions of text-to-image generators without retraining by embedding edition concepts as PCA-derived subspaces in the CLIP text-embedding space. It introduces a training-free editioning framework where input prompts are projected into concept subspaces $S_{T(C)}$, yielding editioned outputs $I = M(p_T|C)$ that are constrained to target concepts. Across extensive experiments, the method achieves high editioning accuracy and maintains image quality comparable to the base diffusion model, while offering insights into the semantic structure of CLIP subspaces and interpolation capabilities. The work highlights practical business implications, including new product differentiations and pricing strategies for personalized T2I services, enabled by efficient, training-free customization.

Abstract

Inspired by the software industry's practice of offering different editions or versions of a product tailored to specific user groups or use cases, we propose a novel task, namely, training-free editioning, for text-to-image models. Specifically, we aim to create variations of a base text-to-image model without retraining, enabling the model to cater to the diverse needs of different user groups or to offer distinct features and functionalities. To achieve this, we propose that different editions of a given text-to-image model can be formulated as concept subspaces in the latent space of its text encoder (e.g., CLIP). In such a concept subspace, all points satisfy a specific user need (e.g., generating images of a cat lying on the grass/ground/falling leaves). Technically, we apply Principal Component Analysis (PCA) to obtain the desired concept subspaces from representative text embedding that correspond to a specific user need or requirement. Projecting the text embedding of a given prompt into these low-dimensional subspaces enables efficient model editioning without retraining. Intuitively, our proposed editioning paradigm enables a service provider to customize the base model into its "cat edition" (or other editions) that restricts image generation to cats, regardless of the user's prompt (e.g., dogs, people, etc.). This introduces a new dimension for product differentiation, targeted functionality, and pricing strategies, unlocking novel business models for text-to-image generators. Extensive experimental results demonstrate the validity of our approach and its potential to enable a wide range of customized text-to-image model editions across various domains and applications.

Training-free Editioning of Text-to-Image Models

TL;DR

, yielding editioned outputs

that are constrained to target concepts. Across extensive experiments, the method achieves high editioning accuracy and maintains image quality comparable to the base diffusion model, while offering insights into the semantic structure of CLIP subspaces and interpolation capabilities. The work highlights practical business implications, including new product differentiations and pricing strategies for personalized T2I services, enabled by efficient, training-free customization.

Abstract

Paper Structure (21 sections, 8 equations, 13 figures, 7 tables)

This paper contains 21 sections, 8 equations, 13 figures, 7 tables.

Introduction
Related Work
Task Definition
Differences with Image Editing and Concept Erasing
Method
CLIP-based Concept Subspace Projection
Efficient Computation.
Experiments
Experimental Setup
Effectiveness of Concept Subspace Projection for Text-to-Image Model Editioning
Quantitative Results
Qualitative Results
Properties of CLIP Concept Subspaces
Choice of k for Each Concept Subspace
Conclusion
...and 6 more sections

Figures (13)

Figure 1: Illustration of Text-to-Image Model Editioning. Our method can create variations (e.g., Boy Edition, Cat Edition) of a base text-to-image model without retraining, enabling them to cater to the diverse needs of different user groups or to offer distinct features and functionalities.
Figure 2: Overview of our concept subspace creation (top) and projection (bottom).
Figure 3: 13k components yield a cumulative explained variance ratio of 99.9+%.
Figure 4: Images generated by different prompts when using different editions of the Stable Diffusion v1.4 model. The input prompts are shown at the top.
Figure 5: Different images generated by the same prompt when using different editions of the Stable Diffusion v1.4 model. The input prompts are shown at the bottom.
...and 8 more figures

Theorems & Definitions (3)

Definition 1: General
Definition 2: Special
Conjecture 1

Training-free Editioning of Text-to-Image Models

TL;DR

Abstract

Training-free Editioning of Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (3)