Table of Contents
Fetching ...

DreamTuner: Single Image is Enough for Subject-Driven Generation

Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, Qian He

TL;DR

<3-5 sentence high-level summary> DreamTuner addresses the challenge of producing high-fidelity, subject-preserving images from a single reference image in diffusion-based text-to-image generation. It introduces a subject-encoder for coarse identity preservation and a self-subject-attention mechanism to refine subject details, with a training-free inference option and an optional fine-tuning stage. The method decouples content and layout with a frozen ControlNet and incorporates a trainable [S*] embedding to represent the subject, achieving strong subject fidelity while maintaining generation flexibility. Empirical results on natural and anime subjects show that DreamTuner outperforms existing fine-tuning and encoder-based approaches in both prompt fidelity and subject preservation, with efficient training and broad editing capabilities including pose control.</paper_summary>

Abstract

Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.

DreamTuner: Single Image is Enough for Subject-Driven Generation

TL;DR

<3-5 sentence high-level summary> DreamTuner addresses the challenge of producing high-fidelity, subject-preserving images from a single reference image in diffusion-based text-to-image generation. It introduces a subject-encoder for coarse identity preservation and a self-subject-attention mechanism to refine subject details, with a training-free inference option and an optional fine-tuning stage. The method decouples content and layout with a frozen ControlNet and incorporates a trainable [S*] embedding to represent the subject, achieving strong subject fidelity while maintaining generation flexibility. Empirical results on natural and anime subjects show that DreamTuner outperforms existing fine-tuning and encoder-based approaches in both prompt fidelity and subject preservation, with efficient training and broad editing capabilities including pose control.</paper_summary>

Abstract

Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.
Paper Structure (19 sections, 6 equations, 10 figures, 1 table)

This paper contains 19 sections, 6 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Subject-driven image generation results of our method. Our proposed DreamTuner could generate high-fidelity images of user input subject, guided by complex texts (the first two rows) or other conditions as pose (the last row), while maintaining the identity appearance of the specific subject. We found that a single image is enough for surprising subject-driven image generation.
  • Figure 2: Exploration experiment of self-attention. It makes the generated images more similar to the reference one to use the reference image features for self-attention. Detailed text can better serve its purpose.
  • Figure 3: Overview of the proposed DreamTuner framework. Firstly, a subject-encoder (SE) is trained for coarse identity preservation, where a frozen ControlNet is utilized to maintain the layout. Then an additional fine-tuning stage like existing methods is conducted with proposed subject-encoder and self-subject-attention for fine identity preservation. Finally a refined subject driven image generation model is obtained which could synthesis high-fidelity images of the specific subject controlled by text or other layout conditions. It is worth noting that both of the subject-driven fine-tuning stage and inference stage require only a single reference image.
  • Figure 4: Illustration of the text-to-image generation U-Net model with proposed subject-encoder.
  • Figure 5: Illustration of the proposed self-subject-attention. S-A indicates self-attention.
  • ...and 5 more figures