OFER: Occluded Face Expression Reconstruction
Pratheba Selvaraju, Victoria Fernandez Abrevaya, Timo Bolkart, Rick Akkerman, Tianyu Ding, Faezeh Amjadi, Ilya Zharkov
TL;DR
OFER addresses the challenge of reconstructing 3D faces from a single occluded image by generating a distribution of plausible shape and expression hypotheses via two conditional diffusion models on FLAME parameters; a novel identity-ranking mechanism selects a consistent shape while ExpGen provides diverse expressions. The method enables multi-hypothesis reasoning under occlusion and introduces CO-545, a protocol to evaluate expressive 3D reconstructions in occluded scenarios. On benchmarks like NoW and CO-545, OFER achieves improved accuracy and richer expression diversity compared to state-of-the-art occlusion-focused methods, while providing a principled selection of identity through ranking. Overall, OFER combines diffusion-based generative modeling with a learning-to-rank framework to produce plausible, diverse, and identity-consistent 3D face reconstructions from a single occluded image, offering practical utility for avatars and telepresence.
Abstract
Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. However, to maintain consistency across diverse expressions, the challenge is to select the best matching shape. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on predicted shape accuracy scores. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, while also enabling the generation of diverse expressions for a given image.
