Table of Contents
Fetching ...

High-Fidelity Human Avatars from Laptop Webcams using Edge Compute

Akash Haridas, Imran N. Junejo

TL;DR

This work tackles the problem of producing high-fidelity, animatable human avatars from consumer laptop webcams using on-device edge compute. It presents an end-to-end pipeline that fits FLAME-based 3DMM shapes and textures from multi-view webcam imagery, employs differentiable rendering with loss terms $L_{RGB}$, $L_{identity}$, and $L_{landmark}$, and uses Laplacian pyramid blending to achieve neutral illumination across textures, complemented by a self-supervised texture refinement network and optional face restoration. The approach demonstrates improved lighting neutrality and texture quality, with qualitative Blender renderings and quantitative metrics showing advantages over a close competitor, Avaturn. The work highlights practical impact for privacy-preserving, bandwidth-efficient avatars in video conferencing, gaming, and VR on consumer-grade hardware, enabling real-time, edge-compute avatar generation.

Abstract

Applications of generating photo-realistic human avatars are many, however, high-fidelity avatar generation traditionally required expensive professional camera rigs and artistic labor, but recent research has enabled constructing them automatically from smartphones with RGB and IR sensors. However, these new methods still rely on the presence of high-resolution cameras on modern smartphones and often require offloading the processing to powerful servers with GPUs. Modern applications such as video conferencing call for the ability to generate these avatars from consumer-grade laptop webcams using limited compute available on-device. In this work, we develop a novel method based on 3D morphable models, landmark detection, photo-realistic texture GANs, and differentiable rendering to tackle the problem of low webcam image quality and edge computation. We build an automatic system to generate high-fidelity animatable avatars under these limitations, leveraging the neural compute capabilities of mobile chips.

High-Fidelity Human Avatars from Laptop Webcams using Edge Compute

TL;DR

This work tackles the problem of producing high-fidelity, animatable human avatars from consumer laptop webcams using on-device edge compute. It presents an end-to-end pipeline that fits FLAME-based 3DMM shapes and textures from multi-view webcam imagery, employs differentiable rendering with loss terms , , and , and uses Laplacian pyramid blending to achieve neutral illumination across textures, complemented by a self-supervised texture refinement network and optional face restoration. The approach demonstrates improved lighting neutrality and texture quality, with qualitative Blender renderings and quantitative metrics showing advantages over a close competitor, Avaturn. The work highlights practical impact for privacy-preserving, bandwidth-efficient avatars in video conferencing, gaming, and VR on consumer-grade hardware, enabling real-time, edge-compute avatar generation.

Abstract

Applications of generating photo-realistic human avatars are many, however, high-fidelity avatar generation traditionally required expensive professional camera rigs and artistic labor, but recent research has enabled constructing them automatically from smartphones with RGB and IR sensors. However, these new methods still rely on the presence of high-resolution cameras on modern smartphones and often require offloading the processing to powerful servers with GPUs. Modern applications such as video conferencing call for the ability to generate these avatars from consumer-grade laptop webcams using limited compute available on-device. In this work, we develop a novel method based on 3D morphable models, landmark detection, photo-realistic texture GANs, and differentiable rendering to tackle the problem of low webcam image quality and edge computation. We build an automatic system to generate high-fidelity animatable avatars under these limitations, leveraging the neural compute capabilities of mobile chips.

Paper Structure

This paper contains 4 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: High-fidelity renderings of the avatars generated from the multi-view image test sets captured on a laptop webcam.
  • Figure 2: Our end-to-end avatar generation pipeline
  • Figure 3: The self-supervised pipeline to train a U-Net model to restore a simulated laptop webcam degradation in UV coordinate space. The orange arrows depict the backward propagation of gradients.
  • Figure 4: An illustration of our novel Laplacian pyramid blending procedure. It takes a facial texture generated from a set of unevenly or directionally illuminated source images, and a template facial texture with neutral illumination, and produces an evenly illuminated texture containing unique identifiable features from the source images.
  • Figure 5: Comparison with Avaturn, a closed-source tool with similar computational cost.
  • ...and 1 more figures