Table of Contents
Fetching ...

SerialGen: Personalized Image Generation by First Standardization Then Personalization

Cong Xie, Han Zou, Ruiqi Yu, Yan Zhang, Zhenpeng Zhan

TL;DR

SerialGen addresses the challenge of achieving high text controllability while preserving whole-body appearance in personalized image generation. It introduces a two-stage, tuning-free pipeline: a first standardization stage that realigns non-appearance factors using a standardization model with Foreground-Background Distinction Module (FBDM) and Reference Pose Injection Module (RPIM), followed by a personalization stage that trains a diffusion-based model on (standardized reference, target) pairs. The method leverages synthetic data for standardization and a frozen standardization model during personalization, achieving strong CLIP-I, CLIP-T, and Face Sim metrics, along with favorable user study results. The approach yields consistent, serial outputs across prompts and demonstrates superior performance on both synthetic and real-world tasks, with practical implications for comic/story generation and other applications requiring consistent character appearance. Future work discusses domain-gap mitigation between synthetic and real data and further enhancements to identity preservation across stages.

Abstract

In this work, we are interested in achieving both high text controllability and whole-body appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's whole-body appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.

SerialGen: Personalized Image Generation by First Standardization Then Personalization

TL;DR

SerialGen addresses the challenge of achieving high text controllability while preserving whole-body appearance in personalized image generation. It introduces a two-stage, tuning-free pipeline: a first standardization stage that realigns non-appearance factors using a standardization model with Foreground-Background Distinction Module (FBDM) and Reference Pose Injection Module (RPIM), followed by a personalization stage that trains a diffusion-based model on (standardized reference, target) pairs. The method leverages synthetic data for standardization and a frozen standardization model during personalization, achieving strong CLIP-I, CLIP-T, and Face Sim metrics, along with favorable user study results. The approach yields consistent, serial outputs across prompts and demonstrates superior performance on both synthetic and real-world tasks, with practical implications for comic/story generation and other applications requiring consistent character appearance. Future work discusses domain-gap mitigation between synthetic and real data and further enhancements to identity preservation across stages.

Abstract

In this work, we are interested in achieving both high text controllability and whole-body appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's whole-body appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.

Paper Structure

This paper contains 30 sections, 5 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Serial images generated by SerialGen. Our method can produce personalized images that faithfully recover the reference image's whole-body appearance while accurately responding to a wide range of text prompts. More results are in the supplementary material.
  • Figure 2: Overview of the proposed SerialGen with two stages: (1) Standardization – training a standardization model on synthetic data, and (2) Personalization – using the standardization model to create (standardized reference, target) pairs for personalized text-to-image model training. During inference, once a reference image is standardized, serial images can be generated based on different text prompts.
  • Figure 3: Illustration of the standardization model. The pose and mask of the reference image are input into ReferenceNet to enhance the effect.
  • Figure 4: Comparison with other methods. Our method is capable of generating images with high text controllability and appearance consistency.
  • Figure 5: Visual comparison of different training strategies. Prompt: 1girl, cooking the meal, in the park. (a) Reference Image; (b) Unpaired one-stage; (c) Paired one-stage; (d) Ours.
  • ...and 9 more figures