Table of Contents
Fetching ...

CapHuman: Capture Your Moments in Parallel Universes

Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang

TL;DR

CapHuman targets one-shot personalized portrait synthesis with broad head-pose and lighting variation while preserving identity. It achieves this via an encode-then-align identity preservation scheme that injects global and local identity features into a pre-trained diffusion backbone, and a 3D facial prior (via FLAME/DECA) to enable 3D-consistent, fine-grained head control. The framework introduces a CapFace module, time-dependent identity dropout, and optionally post-hoc head-control fusion to balance identity, pose, and realism without subject-specific fine-tuning. A new HumanIPHC benchmark with 100 identities and 35 prompts across 10 head conditions demonstrates state-of-the-art performance in identity preservation, text-to-image alignment, and head-control precision, with strong qualitative results across multiple styles and applications. Overall, CapHuman advances practical, controllable, identity-preserving human image synthesis with broad potential for personalized avatars, stylized portraits, and ID-like outputs while acknowledging ethical considerations and biases.

Abstract

We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.

CapHuman: Capture Your Moments in Parallel Universes

TL;DR

CapHuman targets one-shot personalized portrait synthesis with broad head-pose and lighting variation while preserving identity. It achieves this via an encode-then-align identity preservation scheme that injects global and local identity features into a pre-trained diffusion backbone, and a 3D facial prior (via FLAME/DECA) to enable 3D-consistent, fine-grained head control. The framework introduces a CapFace module, time-dependent identity dropout, and optionally post-hoc head-control fusion to balance identity, pose, and realism without subject-specific fine-tuning. A new HumanIPHC benchmark with 100 identities and 35 prompts across 10 head conditions demonstrates state-of-the-art performance in identity preservation, text-to-image alignment, and head-control precision, with strong qualitative results across multiple styles and applications. Overall, CapHuman advances practical, controllable, identity-preserving human image synthesis with broad potential for personalized avatars, stylized portraits, and ID-like outputs while acknowledging ethical considerations and biases.

Abstract

We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.
Paper Structure (62 sections, 9 equations, 22 figures, 9 tables)

This paper contains 62 sections, 9 equations, 22 figures, 9 tables.

Figures (22)

  • Figure 1: Given only one reference facial photograph, our CapHuman can generate photo-realistic specific individual portraits with content-rich representations and diverse head positions, poses, facial expressions, and illuminations in different contexts.
  • Figure 2: Overview of CapHuman. Our CapHuman stands upon the pre-trained T2I diffusion model. a) We embrace the "encode then learn to align" paradigm for generalizable identity preservation. b) The introduction of the 3D parametric face model enables flexible and fine-grained head control. c) We learn a CapFace module $\pi$ to equip the pre-trained T2I diffusion model with the above capabilities.
  • Figure 3: Qualitative results. Our CapHuman can produce identity-preserved, photo-realistic portraits with various head positions and poses in different contexts. Also, our model can be combined with the pre-trained model, e.g. RealisticVision Realistic in the community flexibly.
  • Figure 4: Head position, pose, facial expression, and illumination control. Our method offers the 3D-consistent head control.
  • Figure 5: Adapt our model to other pre-trained models. Our model can be adapted to generate portraits in different styles.
  • ...and 17 more figures