Table of Contents
Fetching ...

MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

Jianwen Jiang, Gaojie Lin, Zhengkun Rong, Chao Liang, Yongming Zhu, Jiaqi Yang, Tianyun Zhong

TL;DR

MobilePortrait tackles the challenge of real-time, on-device neural head avatars by fusing explicit and implicit facial motion cues with precomputed appearance information. The approach uses mixed keypoints and pseudo multiview/background features fed into lightweight U-Nets, achieving state-of-the-art-like quality at a fraction of the computational cost and enabling 100+ FPS on mobile devices. It introduces facial-knowledge losses and appearance-knowledge augmentation to bolster motion and synthesis while maintaining efficiency. The method supports both video- and audio-driven inputs, demonstrating robust performance across identities, motions, and deployment scenarios, with clear benefits for on-device avatar applications.

Abstract

Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and using simple U-Nets as backbones, our method achieves state-of-the-art performance with less than one-tenth the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs.

MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

TL;DR

MobilePortrait tackles the challenge of real-time, on-device neural head avatars by fusing explicit and implicit facial motion cues with precomputed appearance information. The approach uses mixed keypoints and pseudo multiview/background features fed into lightweight U-Nets, achieving state-of-the-art-like quality at a fraction of the computational cost and enabling 100+ FPS on mobile devices. It introduces facial-knowledge losses and appearance-knowledge augmentation to bolster motion and synthesis while maintaining efficiency. The method supports both video- and audio-driven inputs, demonstrating robust performance across identities, motions, and deployment scenarios, with clear benefits for on-device avatar applications.

Abstract

Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and using simple U-Nets as backbones, our method achieves state-of-the-art performance with less than one-tenth the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs.
Paper Structure (13 sections, 1 equation, 5 figures, 6 tables)

This paper contains 13 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The provided examples (on the left) demonstrate that our methods can achieve results comparable to or even better than those of current high-computation state-of-the-art methods, but with less than one-tenth of the computational cost. On the right, a bubble chart compares various methods, with the size of each bubble representing the model's parameter size. This further confirms that our method can produce high-quality results while offering a significant advantage in computational efficiency.
  • Figure 2: The video-driven pipeline of MobilePortrait. MobilePortrait processes source and driving image to generate mixed keypoints that are merged from detected neural and facial keypoints. These mixed keypoints, along with precomputed source masks, are used to create optical flow for image warping via a dense motion network. The synthesis network generates the final image by combining the warped image with precomputed pseudo background and multiview foreground features. Since facial and appearance knowledge is precomputed just once, the two simple U-Net backbones account for nearly all of the computational load during inference. In audio-driven mode, an audio-to-keypoints module supplies the driving keypoints.
  • Figure 3: The motion generation process of MobilePortrait. (a) represents the optical flow generation method adopted by our MobilePortrait, where NK and FK represent the neural and facial keypoints, respectively. (b) is the method used in literatures siarohin2019fommzhao2022tpshong2023mcnsiarohin2021mraa; (c) is similar to literature zhang2023metaportrait, which directly obtains optical flow through CNN. For brevity, we omitted the heatmap generation and occlusion process.
  • Figure 4: Visualizations Comparisons among models with different FLOPs.
  • Figure 5: Visualization Results. To compare with other methods visually, we selected various styles of input images and rich motions to demonstrate the robustness of MobilePortrait (more results are shown on the project page). The video results are provided in the supplementary materials.