Table of Contents
Fetching ...

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang

TL;DR

Subject-driven text-to-image synthesis often overfits to subject-specific cues embedded in learnable tokens or encoders, causing text-prompt attributes to be underrepresented. SAG introduces a subject-agnostic conditioning and Dual Classifier-Free Guidance to suppress subject cues in early iterations while reintroducing them later, improving alignment with both the target subject and the text prompt. The approach is simple to implement and applies across optimization-based and encoder-based personalization, as well as DreamBooth-based fine-tuning, with consistent improvements demonstrated on ELITE, Textual Inversion, SuTI, and DreamSuTI via quantitative metrics and user studies. This yields more faithful, controllable, and diverse Subject-Driven Image Synthesis without retraining or architectural changes to existing diffusion pipelines.

Abstract

In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

TL;DR

Subject-driven text-to-image synthesis often overfits to subject-specific cues embedded in learnable tokens or encoders, causing text-prompt attributes to be underrepresented. SAG introduces a subject-agnostic conditioning and Dual Classifier-Free Guidance to suppress subject cues in early iterations while reintroducing them later, improving alignment with both the target subject and the text prompt. The approach is simple to implement and applies across optimization-based and encoder-based personalization, as well as DreamBooth-based fine-tuning, with consistent improvements demonstrated on ELITE, Textual Inversion, SuTI, and DreamSuTI via quantitative metrics and user studies. This yields more faithful, controllable, and diverse Subject-Driven Image Synthesis without retraining or architectural changes to existing diffusion pipelines.

Abstract

In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.
Paper Structure (22 sections, 6 equations, 16 figures, 2 tables)

This paper contains 22 sections, 6 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Addressing Content Ignorance. Given user-provided subject images, a part of the content specified in the text prompt (highlighted in blue) are overlooked. Our Subject-Agnostic Guidance (SAG) aligns the output more closely with both the target subject and text prompt. Here S$^*$ denotes a pseudo-word, with its text embedding replaced by a learnable subject embedding.
  • Figure 2: Overview of SAG. Given a subject-aware embedding, we first construct a subject-agnostic embedding. These embeddings are subsequently used in our dual classifier-free guidance (DCFG), which consists of weak classifier-free guidance and null-classifier-free guidance. Null CFG adopts a constant weight (Eqn. \ref{['eq:cfg']}) and Weak CFG adopts a variable weight (Eqn. \ref{['eq:DCFG']}).
  • Figure 3: SAG on ELITE wei2023elite. Our ELITE-SAG produces outputs that are more faithful to text prompts while still preserving subject identity. For Stable Diffusion, we generate pure text-to-image results by substituting "S$^*$" with "A dog" or "A cat".
  • Figure 4: SAG on Textual Inversion gal2023image. Our SAG improves text alignment without sacrificing the identity of the subject.
  • Figure 5: SAG on SuTI chen2023subject. When applying SAG on SuTI, the subject is discarded during initial iterations, yielding outputs with markedly improved text alignment. Reference images are not provided to protect privacy.
  • ...and 11 more figures