Table of Contents
Fetching ...

RealisID: Scale-Robust and Fine-Controllable Identity Customization via Local and Global Complementation

Zhaoyang Sun, Fei Du, Weihua Chen, Fan Wang, Yaxiong Chen, Yi Rong, Shengwu Xiong

TL;DR

RealisID tackles identity customization in text-to-image synthesis by introducing two complementary branches that separately control facial details and global image layout, enabling scale-robust identity fidelity even for small faces and multi-person scenarios. It learns condition signals from reference faces via ID embeddings, pose-expression, and location guidance, and fuses them through local and global ControlNet variants within a frozen diffusion backbone. The approach achieves strong qualitative and quantitative performance against state-of-the-art baselines, particularly in small-face identity preservation and flexible control of pose, expression, and layout. This framework offers practical implications for scalable, fine-grained identity customization without per-ID fine-tuning, and extends naturally to multi-person applications.

Abstract

Recently, the success of text-to-image synthesis has greatly advanced the development of identity customization techniques, whose main goal is to produce realistic identity-specific photographs based on text prompts and reference face images. However, it is difficult for existing identity customization methods to simultaneously meet the various requirements of different real-world applications, including the identity fidelity of small face, the control of face location, pose and expression, as well as the customization of multiple persons. To this end, we propose a scale-robust and fine-controllable method, namely RealisID, which learns different control capabilities through the cooperation between a pair of local and global branches. Specifically, by using cropping and up-sampling operations to filter out face-irrelevant information, the local branch concentrates the fine control of facial details and the scale-robust identity fidelity within the face region. Meanwhile, the global branch manages the overall harmony of the entire image. It also controls the face location by taking the location guidance as input. As a result, RealisID can benefit from the complementarity of these two branches. Finally, by implementing our branches with two different variants of ControlNet, our method can be easily extended to handle multi-person customization, even only trained on single-person datasets. Extensive experiments and ablation studies indicate the effectiveness of RealisID and verify its ability in fulfilling all the requirements mentioned above.

RealisID: Scale-Robust and Fine-Controllable Identity Customization via Local and Global Complementation

TL;DR

RealisID tackles identity customization in text-to-image synthesis by introducing two complementary branches that separately control facial details and global image layout, enabling scale-robust identity fidelity even for small faces and multi-person scenarios. It learns condition signals from reference faces via ID embeddings, pose-expression, and location guidance, and fuses them through local and global ControlNet variants within a frozen diffusion backbone. The approach achieves strong qualitative and quantitative performance against state-of-the-art baselines, particularly in small-face identity preservation and flexible control of pose, expression, and layout. This framework offers practical implications for scalable, fine-grained identity customization without per-ID fine-tuning, and extends naturally to multi-person applications.

Abstract

Recently, the success of text-to-image synthesis has greatly advanced the development of identity customization techniques, whose main goal is to produce realistic identity-specific photographs based on text prompts and reference face images. However, it is difficult for existing identity customization methods to simultaneously meet the various requirements of different real-world applications, including the identity fidelity of small face, the control of face location, pose and expression, as well as the customization of multiple persons. To this end, we propose a scale-robust and fine-controllable method, namely RealisID, which learns different control capabilities through the cooperation between a pair of local and global branches. Specifically, by using cropping and up-sampling operations to filter out face-irrelevant information, the local branch concentrates the fine control of facial details and the scale-robust identity fidelity within the face region. Meanwhile, the global branch manages the overall harmony of the entire image. It also controls the face location by taking the location guidance as input. As a result, RealisID can benefit from the complementarity of these two branches. Finally, by implementing our branches with two different variants of ControlNet, our method can be easily extended to handle multi-person customization, even only trained on single-person datasets. Extensive experiments and ablation studies indicate the effectiveness of RealisID and verify its ability in fulfilling all the requirements mentioned above.

Paper Structure

This paper contains 34 sections, 8 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Our RealisID can flexibly and finely control the face location, pose and expression factors of the generated facial images. It is also able to keep high identity fidelity for small faces and easily generalizes to multi-person customization.
  • Figure 2: (a) The overall architecture of our RealisID framework, which constructs a pair of local and global branches to inject additional condition information into the U-Net denoiser of a pre-trained stable diffusion model. (b) The procedure of extracting different condition signals from the input reference images. (c) The inference strategy for handling multi-person customization.
  • Figure 3: Qualitative comparison between different methods. The odd and even rows correspond to regular and small face scenarios, respectively. Regardless of face scales, our RealisID framework achieves the high fidelity of identity and facial details, thus generating visually appealing portrait images.
  • Figure 4: Effects of our local and global branches.
  • Figure 5: Ablation study on scale robustness. Our method achieves high identity fidelity across different face sizes.
  • ...and 11 more figures