Table of Contents
Fetching ...

OFER: Occluded Face Expression Reconstruction

Pratheba Selvaraju, Victoria Fernandez Abrevaya, Timo Bolkart, Rick Akkerman, Tianyu Ding, Faezeh Amjadi, Ilya Zharkov

TL;DR

OFER addresses the challenge of reconstructing 3D faces from a single occluded image by generating a distribution of plausible shape and expression hypotheses via two conditional diffusion models on FLAME parameters; a novel identity-ranking mechanism selects a consistent shape while ExpGen provides diverse expressions. The method enables multi-hypothesis reasoning under occlusion and introduces CO-545, a protocol to evaluate expressive 3D reconstructions in occluded scenarios. On benchmarks like NoW and CO-545, OFER achieves improved accuracy and richer expression diversity compared to state-of-the-art occlusion-focused methods, while providing a principled selection of identity through ranking. Overall, OFER combines diffusion-based generative modeling with a learning-to-rank framework to produce plausible, diverse, and identity-consistent 3D face reconstructions from a single occluded image, offering practical utility for avatars and telepresence.

Abstract

Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. However, to maintain consistency across diverse expressions, the challenge is to select the best matching shape. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on predicted shape accuracy scores. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, while also enabling the generation of diverse expressions for a given image.

OFER: Occluded Face Expression Reconstruction

TL;DR

OFER addresses the challenge of reconstructing 3D faces from a single occluded image by generating a distribution of plausible shape and expression hypotheses via two conditional diffusion models on FLAME parameters; a novel identity-ranking mechanism selects a consistent shape while ExpGen provides diverse expressions. The method enables multi-hypothesis reasoning under occlusion and introduces CO-545, a protocol to evaluate expressive 3D reconstructions in occluded scenarios. On benchmarks like NoW and CO-545, OFER achieves improved accuracy and richer expression diversity compared to state-of-the-art occlusion-focused methods, while providing a principled selection of identity through ranking. Overall, OFER combines diffusion-based generative modeling with a learning-to-rank framework to produce plausible, diverse, and identity-consistent 3D face reconstructions from a single occluded image, offering practical utility for avatars and telepresence.

Abstract

Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. However, to maintain consistency across diverse expressions, the challenge is to select the best matching shape. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on predicted shape accuracy scores. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, while also enabling the generation of diverse expressions for a given image.

Paper Structure

This paper contains 48 sections, 7 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Comparison against expression reconstruction methods. We show results from EMOCA danecek2022emoca (pink); samples generated by Diverse3D dey2022generating (blue); and samples generated by our method (green). EMOCA can only reconstruct a single solution due to its deterministic nature, while Diverse3D shows non-plausible faces. OFER (our method) generates diverse 3D faces with plausible expressions.
  • Figure 2: Overview of OFER. Given an input image, the Identity Generative Network (IdGen) samples N shape parameters. The reconstructed shapes are then passed to the Identity Ranking Network (IdRank) to select a unique identity. Finally, the Expression Generative Network (ExpGen) generates N expression parameters, which are combined with the selected shape to output diverse and expressive face reconstructions (bottom row).
  • Figure 3: Overview of the identity and expression generative networks (IdGen, in blue, and ExpGen, in green). For IdGen, the input image is encoded into a 512-dimensional embedding using ArcFace Deng2018ArcFaceAA. This serves as a condition for the 1D U-Net diffusion network, which is trained to denoise 300-dimensional noise into FLAME shape coefficients, S. For ExpGen, the input image is encoded into a 1024-dimensional embedding using the FaRL zheng2022general and ArcFace Deng2018ArcFaceAA encoders. The embedding serves as a condition for the 1D U-Net diffusion network, which is trained to denoise 50-dimensional noise into FLAME expression coefficients, E.
  • Figure 4: Identity Ranking Network. Given the N shape coefficients from IdGen, we reconstruct the neutral meshes using FLAME. Each mesh is passed through a 5-layer MLP to compute a score, conditioned on the input image using the ArcFace Deng2018ArcFaceAA and FaRL zheng2022general encoders. The N scores are then converted into probabilities using softmax. The ranking order of the sorted scores is compared against the ranking of the sorted reconstructed errors, and the network is trained to match them.
  • Figure 5: Neutral face reconstruction on occluded images. For (a) the given occluded input image, (b) shows the reconstructed shape provided by MICA zielonka2022mica; (c) is one of the generated samples from our method; (d) and (e) are the best and worst-ranked samples, respectively, as selected by the ranking network.
  • ...and 9 more figures