Table of Contents
Fetching ...

3D Face Reconstruction From Radar Images

Valentin Braeutigam, Vanessa Wirth, Ingrid Ullmann, Christian Schüßler, Martin Vossiek, Matthias Berking, Bernhard Egger

TL;DR

The paper tackles 3D face reconstruction from mmWave radar data, addressing limitations of optical methods in environments like sleep labs. It introduces a model-based approach that first uses a CNN encoder to predict 3DMM parameters from synthetic radar images, then augments this with a learned, differentiable radar renderer to form a model-based autoencoder for Analysis-by-Synthesis. The approach enables unsupervised test-time fine-tuning via image reconstructions and demonstrates that the autoencoder improves shape and expression recovery over a purely supervised encoder, with depth information helping shape accuracy. The work enables faster, differentiable radar-based face reconstruction with practical implications for bed-side monitoring and privacy-conscious applications, while highlighting domain gaps between synthetic and real data and calling for broader real-data collections and improved reflectance modeling.

Abstract

The 3D reconstruction of faces gains wide attention in computer vision and is used in many fields of application, for example, animation, virtual reality, and even forensics. This work is motivated by monitoring patients in sleep laboratories. Due to their unique characteristics, sensors from the radar domain have advantages compared to optical sensors, namely penetration of electrically non-conductive materials and independence of light. These advantages of radar signals unlock new applications and require adaptation of 3D reconstruction frameworks. We propose a novel model-based method for 3D reconstruction from radar images. We generate a dataset of synthetic radar images with a physics-based but non-differentiable radar renderer. This dataset is used to train a CNN-based encoder to estimate the parameters of a 3D morphable face model. Whilst the encoder alone already leads to strong reconstructions of synthetic data, we extend our reconstruction in an Analysis-by-Synthesis fashion to a model-based autoencoder. This is enabled by learning the rendering process in the decoder, which acts as an object-specific differentiable radar renderer. Subsequently, the combination of both network parts is trained to minimize both, the loss of the parameters and the loss of the resulting reconstructed radar image. This leads to the additional benefit, that at test time the parameters can be further optimized by finetuning the autoencoder unsupervised on the image loss. We evaluated our framework on generated synthetic face images as well as on real radar images with 3D ground truth of four individuals.

3D Face Reconstruction From Radar Images

TL;DR

The paper tackles 3D face reconstruction from mmWave radar data, addressing limitations of optical methods in environments like sleep labs. It introduces a model-based approach that first uses a CNN encoder to predict 3DMM parameters from synthetic radar images, then augments this with a learned, differentiable radar renderer to form a model-based autoencoder for Analysis-by-Synthesis. The approach enables unsupervised test-time fine-tuning via image reconstructions and demonstrates that the autoencoder improves shape and expression recovery over a purely supervised encoder, with depth information helping shape accuracy. The work enables faster, differentiable radar-based face reconstruction with practical implications for bed-side monitoring and privacy-conscious applications, while highlighting domain gaps between synthetic and real data and calling for broader real-data collections and improved reflectance modeling.

Abstract

The 3D reconstruction of faces gains wide attention in computer vision and is used in many fields of application, for example, animation, virtual reality, and even forensics. This work is motivated by monitoring patients in sleep laboratories. Due to their unique characteristics, sensors from the radar domain have advantages compared to optical sensors, namely penetration of electrically non-conductive materials and independence of light. These advantages of radar signals unlock new applications and require adaptation of 3D reconstruction frameworks. We propose a novel model-based method for 3D reconstruction from radar images. We generate a dataset of synthetic radar images with a physics-based but non-differentiable radar renderer. This dataset is used to train a CNN-based encoder to estimate the parameters of a 3D morphable face model. Whilst the encoder alone already leads to strong reconstructions of synthetic data, we extend our reconstruction in an Analysis-by-Synthesis fashion to a model-based autoencoder. This is enabled by learning the rendering process in the decoder, which acts as an object-specific differentiable radar renderer. Subsequently, the combination of both network parts is trained to minimize both, the loss of the parameters and the loss of the resulting reconstructed radar image. This leads to the additional benefit, that at test time the parameters can be further optimized by finetuning the autoencoder unsupervised on the image loss. We evaluated our framework on generated synthetic face images as well as on real radar images with 3D ground truth of four individuals.

Paper Structure

This paper contains 13 sections, 1 equation, 17 figures, 5 tables.

Figures (17)

  • Figure 1: The real radar setup and RGB cameras for photogrammetry. The radar module consists of 94 transmitter antennas and 94 receiver antennas in a square-shaped placement. Around the radar module five cameras are positioned to additionally reconstruct the captured face via photogrammetry. Four persons were captured in this setup, with each showing five different facial expressions.
  • Figure 2: Examples for a real radar images (left) and synthetic-real radar images (right). The amplitude images have a dynamic range of -20 dB. The synthetic real images are generated from the mesh of the same person reconstructed via photogrammetry.
  • Figure 3: Examples of a synthetic amplitude image (left) with a dynamic range of -20 dB and a synthetic depth image in comparison (right).
  • Figure 4: Overview of our method. The input image is fed to three encoder networks which predict the shape, expression, and pose of the face. These parameters are then fed to a differentiable renderer that reconstructs the input image. The encoder consists of two ResNet-50 models for predicting shape and expression and an AlexNet model for predicting the pose. The differentiable renderer is a ResNet-50 model that is ordered in the opposite direction. During training, both the parameter loss and image loss are applied. For inference, the encoder and decoder are frozen and only the image loss is optimized leading to the face model parameters holding the 3D face reconstruction.
  • Figure 5: Cosine similarity comparison between the face model parameters computed by the autoencoder evaluated on synthetic data input with a uniformly sampled pose (top) and a neutral pose (bottom).
  • ...and 12 more figures