Table of Contents
Fetching ...

FPGA: Flexible Portrait Generation Approach

Zhaoli Deng, Fanyi Wang, Junkang Zhang, Fan Chen, Meng Zhang, Wendong Zhang, Wen Liu, Zhenpeng Mi

TL;DR

FPGA tackles the challenge of multi‑ID, full‑body portrait generation with low‑resolution facial detail by combining a Multi‑Mode Fusion training strategy (MMF) and a DDIM Inversion based ID Restoration framework (DIIR). It introduces IDZoom, a million‑scale multi‑modal dataset, and a RepControlNet‑based acceleration to deliver fast, region‑specific identity control and post‑hoc face restoration on diffusion models. Through extensive comparative and ablation experiments, FPGA achieves superior objective and subjective performance and demonstrates robust multi‑ID placement, face restoration, and even face swapping with stylization, while delivering inference times around 2.5 s on a single L20 GPU. The architecture is designed to be plug‑and‑play and broadly compatible with existing diffusion‑based portrait methods, enabling practical deployment for high‑fidelity, controllable portrait synthesis.

Abstract

Portrait Fidelity Generation is a prominent research area in generative models.Current methods face challenges in generating full-body images with low-resolution faces, especially in multi-ID photo phenomenon.To tackle these issues, we propose a comprehensive system called FPGA and construct a million-level multi-modal dataset IDZoom for training.FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). The MMF aims to activate the specified ID in the specified facial region. The DIIR aims to address the issue of face artifacts while keeping the background.Furthermore, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance. DIIR is also capable of performing face-swapping tasks and is applicable to stylized faces as well.To validate the effectiveness of FPGA, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that FPGA has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-ID scenarios. In addition, we accelerate the inference speed to within 2.5 seconds on a single L20 graphics card mainly based on our well designed reparameterization method, RepControlNet.

FPGA: Flexible Portrait Generation Approach

TL;DR

FPGA tackles the challenge of multi‑ID, full‑body portrait generation with low‑resolution facial detail by combining a Multi‑Mode Fusion training strategy (MMF) and a DDIM Inversion based ID Restoration framework (DIIR). It introduces IDZoom, a million‑scale multi‑modal dataset, and a RepControlNet‑based acceleration to deliver fast, region‑specific identity control and post‑hoc face restoration on diffusion models. Through extensive comparative and ablation experiments, FPGA achieves superior objective and subjective performance and demonstrates robust multi‑ID placement, face restoration, and even face swapping with stylization, while delivering inference times around 2.5 s on a single L20 GPU. The architecture is designed to be plug‑and‑play and broadly compatible with existing diffusion‑based portrait methods, enabling practical deployment for high‑fidelity, controllable portrait synthesis.

Abstract

Portrait Fidelity Generation is a prominent research area in generative models.Current methods face challenges in generating full-body images with low-resolution faces, especially in multi-ID photo phenomenon.To tackle these issues, we propose a comprehensive system called FPGA and construct a million-level multi-modal dataset IDZoom for training.FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). The MMF aims to activate the specified ID in the specified facial region. The DIIR aims to address the issue of face artifacts while keeping the background.Furthermore, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance. DIIR is also capable of performing face-swapping tasks and is applicable to stylized faces as well.To validate the effectiveness of FPGA, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that FPGA has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-ID scenarios. In addition, we accelerate the inference speed to within 2.5 seconds on a single L20 graphics card mainly based on our well designed reparameterization method, RepControlNet.
Paper Structure (28 sections, 6 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Generation results of FPGA with different resolution faces, and face restoration capabilities of our DIIR.
  • Figure 2: Flowchart of Multi-Mode Fusion training strategy (MMF), the clone operation in sec. \ref{['sec:3.1.3']} enables the Mask Guided Multi-ID Cross Attention in inference process to achieve the ability of specifying the location of specified face.
  • Figure 3: Flowchart of DDIM Inversion based ID Restoration inference framework (DIIR). DIIR achieves crash face repairment.
  • Figure 4: Flowchart of training and reparameterization process of RepControNet.
  • Figure 5: Visualization comparison with SOTA methods on full-body generation ability. FPGA* shows fantastic performance on low resolution face generation, * means using DIIR. More clear results are presented in supplementary materials.
  • ...and 12 more figures