Table of Contents
Fetching ...

Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization

Lianyu Pang, Ji Zhou, Qiping Wang, Baoquan Zhao, Zhenguo Yang, Qing Li, Xudong Mao

TL;DR

<3-5 sentence high-level summary> UniID tackles the trade-off between identity fidelity and text controllability in tuning-free face personalization by unifying text-embedding and adapter-based approaches. It introduces a dual-branch architecture where each branch learns identity-relevant features through identity-focused training, and employs a layer-wise normalized rescaling during inference to preserve the diffusion model's controllability. The method demonstrates superior identity preservation and prompt alignment compared to six baselines on both synthetic and real portraits, supported by qualitative, quantitative, and user-study evidence. This work offers a practical, principled path toward high-fidelity, controllable, tuning-free personalizations in diffusion-based generation systems.

Abstract

Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID

Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization

TL;DR

<3-5 sentence high-level summary> UniID tackles the trade-off between identity fidelity and text controllability in tuning-free face personalization by unifying text-embedding and adapter-based approaches. It introduces a dual-branch architecture where each branch learns identity-relevant features through identity-focused training, and employs a layer-wise normalized rescaling during inference to preserve the diffusion model's controllability. The method demonstrates superior identity preservation and prompt alignment compared to six baselines on both synthetic and real portraits, supported by qualitative, quantitative, and user-study evidence. This work offers a practical, principled path toward high-fidelity, controllable, tuning-free personalizations in diffusion-based generation systems.

Abstract

Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID

Paper Structure

This paper contains 24 sections, 4 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: UniID enables high-quality face personalization with flexible text controllability and consistent identity preservation.
  • Figure 2: Pairing IP-Adapter with ground-truth identity names significantly enhances identity preservation. However, augmenting IP-Adapter with learned text embeddings substantially degrades text controllability.
  • Figure 3: Overview of UniID. (Top) We map the facial features extracted by the image encoder into the output embeddings of the text encoder. The predicted embeddings are concatenated with those of the given prompt. (Bottom) The extracted facial features are also injected into the pre-trained diffusion model via auxiliary cross-attention layers. At inference time, we apply the proposed normalized rescaling strategy to both branches to recover the text controllability of the original diffusion model.
  • Figure 4: Layer-wise output magnitude ratios.
  • Figure 5: Grid search results for hyperparameters $\alpha$ and $\beta$. Zoom in for a better view.
  • ...and 8 more figures