Table of Contents
Fetching ...

Towards a Pipeline for Real-Time Visualization of Faces for VR-based Telepresence and Live Broadcasting Utilizing Neural Rendering

Philipp Ladwig, Rene Ebertowski, Alexander Pech, Ralf Dörner, Christian Geiger

TL;DR

This work tackles the challenge of preserving realistic facial cues when faces are obscured by VR headsets by proposing an offline-trained, GAN-based pipeline that converts RGBD data into a frontal 2.5D face representation in real time. The approach conditions a Pix2Pix-like GAN on facial landmark maps to synthesize high-quality RGBD outputs, using multi-scale discriminators and perceptual/feature-matching losses to boost detail while maintaining VR-friendly frame rates. Per-person training with a helmet-mounted RGBD setup yields textured point clouds that can be rendered in VR for telepresence and live broadcasting, achieving notable improvements in SSIM and LPIPS over prior methods, with depth accuracy generally within a few millimeters. While still susceptible to artifacts in highly exaggerated expressions (eye, lip, and oral cavity regions) that can trigger the Uncanny Valley, the method demonstrates a practical, low-cost path toward authentic facial avatars on commodity hardware, with publicly available code and real-time performance on modern GPUs.

Abstract

While head-mounted displays (HMDs) for Virtual Reality (VR) have become widely available in the consumer market, they pose a considerable obstacle for a realistic face-to-face conversation in VR since HMDs hide a significant portion of the participants faces. Even with image streams from cameras directly attached to an HMD, stitching together a convincing image of an entire face remains a challenging task because of extreme capture angles and strong lens distortions due to a wide field of view. Compared to the long line of research in VR, reconstruction of faces hidden beneath an HMD is a very recent topic of research. While the current state-of-the-art solutions demonstrate photo-realistic 3D reconstruction results, they require high-cost laboratory equipment and large computational costs. We present an approach that focuses on low-cost hardware and can be used on a commodity gaming computer with a single GPU. We leverage the benefits of an end-to-end pipeline by means of Generative Adversarial Networks (GAN). Our GAN produces a frontal-facing 2.5D point cloud based on a training dataset captured with an RGBD camera. In our approach, the training process is offline, while the reconstruction runs in real-time. Our results show adequate reconstruction quality within the 'learned' expressions. Expressions not learned by the network produce artifacts and can trigger the Uncanny Valley effect.

Towards a Pipeline for Real-Time Visualization of Faces for VR-based Telepresence and Live Broadcasting Utilizing Neural Rendering

TL;DR

This work tackles the challenge of preserving realistic facial cues when faces are obscured by VR headsets by proposing an offline-trained, GAN-based pipeline that converts RGBD data into a frontal 2.5D face representation in real time. The approach conditions a Pix2Pix-like GAN on facial landmark maps to synthesize high-quality RGBD outputs, using multi-scale discriminators and perceptual/feature-matching losses to boost detail while maintaining VR-friendly frame rates. Per-person training with a helmet-mounted RGBD setup yields textured point clouds that can be rendered in VR for telepresence and live broadcasting, achieving notable improvements in SSIM and LPIPS over prior methods, with depth accuracy generally within a few millimeters. While still susceptible to artifacts in highly exaggerated expressions (eye, lip, and oral cavity regions) that can trigger the Uncanny Valley, the method demonstrates a practical, low-cost path toward authentic facial avatars on commodity hardware, with publicly available code and real-time performance on modern GPUs.

Abstract

While head-mounted displays (HMDs) for Virtual Reality (VR) have become widely available in the consumer market, they pose a considerable obstacle for a realistic face-to-face conversation in VR since HMDs hide a significant portion of the participants faces. Even with image streams from cameras directly attached to an HMD, stitching together a convincing image of an entire face remains a challenging task because of extreme capture angles and strong lens distortions due to a wide field of view. Compared to the long line of research in VR, reconstruction of faces hidden beneath an HMD is a very recent topic of research. While the current state-of-the-art solutions demonstrate photo-realistic 3D reconstruction results, they require high-cost laboratory equipment and large computational costs. We present an approach that focuses on low-cost hardware and can be used on a commodity gaming computer with a single GPU. We leverage the benefits of an end-to-end pipeline by means of Generative Adversarial Networks (GAN). Our GAN produces a frontal-facing 2.5D point cloud based on a training dataset captured with an RGBD camera. In our approach, the training process is offline, while the reconstruction runs in real-time. Our results show adequate reconstruction quality within the 'learned' expressions. Expressions not learned by the network produce artifacts and can trigger the Uncanny Valley effect.
Paper Structure (11 sections, 2 equations, 6 figures, 1 table)

This paper contains 11 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our conceptual pipeline: First, we capture several RGBD images with a helmet camera mount. These images are processed and serve as the input data for our GAN. After training, the GAN produces textured point clouds in real time. In this work we improve the data set processing, training and inference stage compared to our previous systems Ladwig2020AufDemWegLadwig2020Unmasking. Building a face-tracking HMD is not part of the present work.
  • Figure 2: Helmet camera mount for RGBD data acquisition. This mount ensures that head rotations are not included in the training dataset and, therefore, reduces the entropy in the dataset. Moreover, this method significantly reduces the training time and increases the visual quality of the output images. The material price of the helmet mount without RGBD camera is about 60 USD.
  • Figure 3: Example convolution for the discriminator input. Each RGBD channel and the FLM are weighted individually.
  • Figure 4: Our new pipeline, network architecture and losses significantly improved the quality. Image a) shows a sample from the previous system of Ladwig et al. Ladwig2020AufDemWegLadwig2020Unmasking. Image b) illustrates the enhanced resolution (from $256 \times 256$ to $512 \times 512$ pixels) and the improved preservation of high-frequency details.
  • Figure 5: Results 1/2: This overview shows FLMs in column 1 from our evaluation datasets, so the results are based on unseen data for the neural network. The FLMs were created from the images in column 3 -- a real image from the evaluation data set. Column 2 shows the results generated by our GAN. The GAN received the FLM from column 1 and generated the images in column 2. Column 4 depicts the SSIM difference. Darker values indicate larger differences between the images in columns 2 and 3. Column 6 visualizes the error between the generated depth and the ground truth depth. The combination of the generated depth and color data can be seen in columns 7 and 8 from an angle of 30 and 90 degrees.
  • ...and 1 more figures