Towards a Pipeline for Real-Time Visualization of Faces for VR-based Telepresence and Live Broadcasting Utilizing Neural Rendering
Philipp Ladwig, Rene Ebertowski, Alexander Pech, Ralf Dörner, Christian Geiger
TL;DR
This work tackles the challenge of preserving realistic facial cues when faces are obscured by VR headsets by proposing an offline-trained, GAN-based pipeline that converts RGBD data into a frontal 2.5D face representation in real time. The approach conditions a Pix2Pix-like GAN on facial landmark maps to synthesize high-quality RGBD outputs, using multi-scale discriminators and perceptual/feature-matching losses to boost detail while maintaining VR-friendly frame rates. Per-person training with a helmet-mounted RGBD setup yields textured point clouds that can be rendered in VR for telepresence and live broadcasting, achieving notable improvements in SSIM and LPIPS over prior methods, with depth accuracy generally within a few millimeters. While still susceptible to artifacts in highly exaggerated expressions (eye, lip, and oral cavity regions) that can trigger the Uncanny Valley, the method demonstrates a practical, low-cost path toward authentic facial avatars on commodity hardware, with publicly available code and real-time performance on modern GPUs.
Abstract
While head-mounted displays (HMDs) for Virtual Reality (VR) have become widely available in the consumer market, they pose a considerable obstacle for a realistic face-to-face conversation in VR since HMDs hide a significant portion of the participants faces. Even with image streams from cameras directly attached to an HMD, stitching together a convincing image of an entire face remains a challenging task because of extreme capture angles and strong lens distortions due to a wide field of view. Compared to the long line of research in VR, reconstruction of faces hidden beneath an HMD is a very recent topic of research. While the current state-of-the-art solutions demonstrate photo-realistic 3D reconstruction results, they require high-cost laboratory equipment and large computational costs. We present an approach that focuses on low-cost hardware and can be used on a commodity gaming computer with a single GPU. We leverage the benefits of an end-to-end pipeline by means of Generative Adversarial Networks (GAN). Our GAN produces a frontal-facing 2.5D point cloud based on a training dataset captured with an RGBD camera. In our approach, the training process is offline, while the reconstruction runs in real-time. Our results show adequate reconstruction quality within the 'learned' expressions. Expressions not learned by the network produce artifacts and can trigger the Uncanny Valley effect.
