Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting
Ankit Gahlawat, Anirban Mukherjee, Dinesh Babu Jayagopi
TL;DR
The paper tackles robust face parsing under extreme poses by creating high-quality, pose-diverse training labels from noisy multiview predictions through a dual 3D Gaussian Splatting framework with shared geometry. It first refines segmentation maps by fitting two synchronized 3DGS models (RGB and segmentation) to enforce cross-view consistency, then renders and clusters to form an auxiliary dataset. A lightweight manual post-processing step cleans the masks, after which a baseline parser (BiSeNet) is fine-tuned, yielding improved accuracy on extreme poses while preserving frontal performance. Experiments on FaceScape with a small identity set show significant gains in mIoU and F1 and competitive perceptual scores in human studies, indicating scalability without ground-truth 3D annotations. The method is model-agnostic and scalable, offering a practical route to robust face parsing in real-world, pose-diverse scenarios.
Abstract
Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.
