Table of Contents
Fetching ...

GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field

Jingtao Zhou, Xuan Gao, Dongyu Liu, Junhui Hou, Yudong Guo, Juyong Zhang

Abstract

We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.

GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field

Abstract

We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.
Paper Structure (28 sections, 13 equations, 16 figures, 3 tables)

This paper contains 28 sections, 13 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Given a target video and few-shot source images, our GSwap firstly adapts a pre-trained 2D portrait generative model to the source head domain. It then elevates the 2D video head swapping task into the dynamic neural gaussian field, significantly enhancing the temporal consistency and overall quality of the head swap.
  • Figure 2: The overall pipeline of our GSwap. Firstly, we adapt a pretrained 2D portrait generation model to the source head domain with source images $\mathcal{I}_{\textrm{src}}$ as input, generating a batch of training data through inpainting (detailed in Fig. \ref{['fig:sec3.1']}). Secondly, we introduce the neural Gaussian field as our portrait representation and use a neural rerenderer to handle the mismatch region between foreground and background, generating the output in photo-realistic quality.
  • Figure 3: We begin by performing domain adaptation on a pre-trained 2D head generation model. Subsequently, we use this model to inpaint the head region in the target video, thus creating a dataset suitable for head swapping applications. We will mosaic some pictures when publishing.
  • Figure 4: We first employ a coarse face reenactment method to obtain a head with the corresponding pose and expression. Then, we warp the head mask of the face reenactment result based on the five key points between the reenacted head and the target head.
  • Figure 5: To mitigate the expression and pose misalignments, we retrack the SMPL-X parameters ($\beta_{\textrm{gen}}$, $\theta_{\textrm{gen}}$, $\psi_{\textrm{gen}}$) of $\mathcal{I}_{\textrm{gen}}$ during the training phase. The parameters tracked from the original target video $\mathcal{I}_{\textrm{tgt}}$ are then used for inference.
  • ...and 11 more figures