Table of Contents
Fetching ...

InstructBooth: Instruction-following Personalized Text-to-Image Generation

Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee

TL;DR

InstructBooth tackles the challenge of producing personalized text-to-image generations that faithfully reflect a user-specific subject while remaining faithful to text prompts. It couples a DreamBooth-like personalization stage using a unique subject identifier with a subsequent reinforcement learning fine-tuning stage that maximizes a text-alignment reward, mitigating overfitting and expanding contextual diversity. The approach introduces detailed subject descriptions for rare subjects and employs prompts both with and without identifiers to stabilize RL training, achieving superior text fidelity and competitive subject fidelity compared to baselines, as confirmed by human judgments and DreamBench benchmarks. This two-stage, reward-driven framework enhances the practical utility of personalized T2I models for expressive, context-rich generation, while also highlighting considerations for safe deployment and future improvements in evaluation datasets and watermarking strategies.

Abstract

Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth.

InstructBooth: Instruction-following Personalized Text-to-Image Generation

TL;DR

InstructBooth tackles the challenge of producing personalized text-to-image generations that faithfully reflect a user-specific subject while remaining faithful to text prompts. It couples a DreamBooth-like personalization stage using a unique subject identifier with a subsequent reinforcement learning fine-tuning stage that maximizes a text-alignment reward, mitigating overfitting and expanding contextual diversity. The approach introduces detailed subject descriptions for rare subjects and employs prompts both with and without identifiers to stabilize RL training, achieving superior text fidelity and competitive subject fidelity compared to baselines, as confirmed by human judgments and DreamBench benchmarks. This two-stage, reward-driven framework enhances the practical utility of personalized T2I models for expressive, context-rich generation, while also highlighting considerations for safe deployment and future improvements in evaluation datasets and watermarking strategies.

Abstract

Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth.
Paper Structure (34 sections, 6 equations, 16 figures, 3 tables)

This paper contains 34 sections, 6 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 2: Comparison of images generated by DreamBooth, Custom Diffusion and InstructBooth with a few images of a specific object (left) on given text prompt.
  • Figure 3: An overview of InstructBooth. Given a user-specific subject (e.g., a cat) from a few input images, our method enables the personalized text-to-image model to generate new images of that subject with various contexts (e.g., "a cat is cooking a gourmet meal"). Our method consists of two main steps: (left) Personalization with a few images of subject, where a pre-trained text-to-image model is fine-tuned with a unique identifier and (right) RL fine-tuning for improving image-text alignment, where we further fine-tune the personalized model to maximize the reward that quantifies image-text alignment.
  • Figure 4: Qualitative comparison against DreamBooth, Custom Diffusion, NeTI, and Textual Inversion. Given a few images of a unique subject (e.g., cat, teddy bear, and pot) and a text prompt, models are required to generate personalized images that align with the prompt. [*] denotes a unique identifier. Please see the Appendix \ref{['sup:samples']} for more diverse samples.
  • Figure 5: Human evaluation results between InstructBooth and baselines. Given two images generated by each model, we ask human raters to indicate which is better in overall quality. The results show the preference rates aggregated via majority voting over seven independent human raters.
  • Figure 6: Samples generated by InstructBooth on unseen text prompts. Our method generates personalized images with high image-text alignment. [*] denotes a unique identifier. We also provide a comparison with other methods in Appendix \ref{['sup:samples']}.
  • ...and 11 more figures