Table of Contents
Fetching ...

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

NaHyeon Park, Kunhee Kim, Hyunjung Shim

TL;DR

This paper addresses the challenge of one-shot personalization by mitigating overfitting by mitigating overfitting, enabling the creation of controllable images through text prompts with a selective fine-tuning strategy that focuses on the text encoder.

Abstract

Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

TL;DR

This paper addresses the challenge of one-shot personalization by mitigating overfitting by mitigating overfitting, enabling the creation of controllable images through text prompts with a selective fine-tuning strategy that focuses on the text encoder.

Abstract

Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.
Paper Structure (39 sections, 8 equations, 16 figures, 6 tables)

This paper contains 39 sections, 8 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Change in weights of different layers during fine-tuning. The mean weight change of text encoder layers is relatively greater than that of U-Net parameters.
  • Figure 2: Method overview. We selectively fine-tune text encoder for one-shot personalization. We utilze three novel techniques to further boost the personalization performance.
  • Figure 3: Qualitative comparison on Stable Diffusion v1.5. We compare images generated by each method using various types of text prompts on different subjects. All models are trained using a single reference image.
  • Figure 4: Diversity comparison. (a) We calculate the inter-similarity of 100 generated images using the DINOv2 score and plot the distribution, given the same reference image and identical prompts. Blue and red horizontal lines indicate the median and mean of each distribution, respectively. (b) Qualitative examples of each method, with two subjects, each with two images per prompt. Note that for a fair comparison, the random seeds are fixed.
  • Figure 5: Comparison of generated attention maps. We compare cross-attention maps of Custom Diffusion and our method. Our approach successfully disentangles subject-relevant information from irrelevant details.
  • ...and 11 more figures