Table of Contents
Fetching ...

TextToon: Real-Time Text Toonify Head Avatar from Single Video

Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, Chenliang Xu

TL;DR

TextToon, a method to generate a drivable toonified avatar that can be driven in real-time by another video with arbitrary identities, and expands the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images.

Abstract

We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.

TextToon: Real-Time Text Toonify Head Avatar from Single Video

TL;DR

TextToon, a method to generate a drivable toonified avatar that can be driven in real-time by another video with arbitrary identities, and expands the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images.

Abstract

We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.

Paper Structure

This paper contains 35 sections, 14 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: The overview of our methods. It takes a monocular video as input and tracks per frame, initializing the Gaussian point clouds using the tracked geometry from the first frame. We leverage the rigid transformation matrix ($\mathbf{R}, \mathbf{T}_{x,y,z}$) and a learnable lazy factor $w$ (in Sec. \ref{['Photo-Realistic Appearance Pre-training']}) to transfer points from the canonical space to the observation space. The proposed conditional Tri-plane Gaussian Deformation Field $\mathbf{D}_c$ uses the normalized render map $m_t$, expression $\beta_t$ and vertex position $\mathbf{S}_t$ to predict the Gaussian properties deformation on each Gaussian points. Both the pre-training and fine-tuning phases share the same structure but target realistic appearance and T2I synthesized appearance, respectively. The details of conditional Tri-plane Gaussian Deformation Field and Text2Image editing are shown in III) and IV) respectively. Natural face$\copyright$Lizhen Wang et al. (CC BY).
  • Figure 2: A visualization of the adaptively selected points via $w$. After introducing the lazy factor $w$, a soft boundary forms between the head and shoulders. Otherwise (w/o), they are mixed together and difficult to distinguish. It avoids mis-segmentation issues (indicated by the blue arrow). Natural face$\copyright$Tee Noir (CC BY).
  • Figure 3: The perceptual evaluation of our method and baselines. For the sake of fairness and to avoid cherry-picked results, we adopt Pixar/cartoon stylization models provided by StyleGANEX and VToonify, and align them with our prompts. The style strength of VToonify is set to $0.7$. Natural face$\copyright$Lizhen Wang et al. (CC BY), and $\copyright$Wojciech Zielonka et al. (CC BY).
  • Figure 4: Visualization of cross-identity driven results. The drive actor is captured in the wild ($\copyright$Obama White House Daily), and the toonified avatar from a different identity is synchronized by facial expressions and poses.
  • Figure 5: The relationship of pre-trained and fine-tuned appearance. We list the different correspondences, a better pre-trained model results in a more detailed fine-tuned appearance (the pre-training iterations are not constant, but the fine-tuning iterations are consistent). The error maps show pixel Euclidean distance in RGB (color in [0, 255]). A lower mean error (white number) indicates a better pre-trained appearance. Please refer to the arrows for facial details. Natural face$\copyright$Yao Feng et al. (CC BY).
  • ...and 8 more figures