Table of Contents
Fetching ...

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu

TL;DR

EZIGen addresses zero-shot personalized image generation by combining a fixed pre-trained Stable Diffusion UNet as a subject encoder with a decoupled, two-stage guidance strategy. It decouples Sketch Generation (text-driven) from Appearance Transfer (subject-driven) and adds an iterative bootstrap to progressively integrate subject details, while training only lightweight adapters. The approach achieves state-of-the-art results on SD2.1-base and SDXL with dramatically reduced training data, and extends naturally to personalized image editing and domain-specific tasks. This yields a versatile, model-agnostic solution that preserves fine-grained subject details and maintains robust prompt adherence, enabling practical, efficient personalization without per-subject retraining.

Abstract

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to incorporate both sources of guidance. Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and imbalanced generation. In this study, we uncover key insights into overcoming such drawbacks, notably that 1) the choice of the subject image encoder critically influences subject identity preservation and training efficiency, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen, initially built upon SD2.1-base, achieved state-of-the-art performances on multiple personalized generation benchmarks with a unified model, while using 100 times less training data. Moreover, by further migrating our design to SDXL, EZIGen is proven to be a versatile model-agnostic solution for personalized generation. Demo Page: zichengduan.github.io/pages/EZIGen/index.html

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

TL;DR

EZIGen addresses zero-shot personalized image generation by combining a fixed pre-trained Stable Diffusion UNet as a subject encoder with a decoupled, two-stage guidance strategy. It decouples Sketch Generation (text-driven) from Appearance Transfer (subject-driven) and adds an iterative bootstrap to progressively integrate subject details, while training only lightweight adapters. The approach achieves state-of-the-art results on SD2.1-base and SDXL with dramatically reduced training data, and extends naturally to personalized image editing and domain-specific tasks. This yields a versatile, model-agnostic solution that preserves fine-grained subject details and maintains robust prompt adherence, enabling practical, efficient personalization without per-subject retraining.

Abstract

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to incorporate both sources of guidance. Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and imbalanced generation. In this study, we uncover key insights into overcoming such drawbacks, notably that 1) the choice of the subject image encoder critically influences subject identity preservation and training efficiency, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen, initially built upon SD2.1-base, achieved state-of-the-art performances on multiple personalized generation benchmarks with a unified model, while using 100 times less training data. Moreover, by further migrating our design to SDXL, EZIGen is proven to be a versatile model-agnostic solution for personalized generation. Demo Page: zichengduan.github.io/pages/EZIGen/index.html
Paper Structure (26 sections, 5 equations, 21 figures, 4 tables)

This paper contains 26 sections, 5 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Our model demonstrates remarkable zero-shot performance in producing high-quality and flexible images from a single reference object and can serve as a versatile model for both personalized image generation and editing with a unified design.
  • Figure 2: Suboptimal subject encoding. BootPIG's encoder design may lead to degraded performance compared to ours.
  • Figure 3: Conflicting guidance. Existing methods struggle to balance between identity preservation and text prompt alignment.
  • Figure 4: Illustration of the proposed system. We begin by Encoding and Injecting subject features (\ref{['sec: subj_util']}). Next, we decouple one generation process into the Sketch Generation Process and Appearance Transfer Process (\ref{['sec: decouple method']}). Finally, we introduce the Iterative Appearance Transfer mechanism (\ref{['sec: iter']}) to fully transfer the subject appearance feature to the sketch latent.
  • Figure 5: Comparison with existing personalized image generation methods. Our design perfectly preserves the subject's fine-grained details (e.g. fur textures, body shape, subject structures) while precisely inheriting the flexibilities (e.g. pose variation) from text-prompt.
  • ...and 16 more figures