Table of Contents
Fetching ...

Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting

Ankit Gahlawat, Anirban Mukherjee, Dinesh Babu Jayagopi

TL;DR

The paper tackles robust face parsing under extreme poses by creating high-quality, pose-diverse training labels from noisy multiview predictions through a dual 3D Gaussian Splatting framework with shared geometry. It first refines segmentation maps by fitting two synchronized 3DGS models (RGB and segmentation) to enforce cross-view consistency, then renders and clusters to form an auxiliary dataset. A lightweight manual post-processing step cleans the masks, after which a baseline parser (BiSeNet) is fine-tuned, yielding improved accuracy on extreme poses while preserving frontal performance. Experiments on FaceScape with a small identity set show significant gains in mIoU and F1 and competitive perceptual scores in human studies, indicating scalability without ground-truth 3D annotations. The method is model-agnostic and scalable, offering a practical route to robust face parsing in real-world, pose-diverse scenarios.

Abstract

Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.

Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting

TL;DR

The paper tackles robust face parsing under extreme poses by creating high-quality, pose-diverse training labels from noisy multiview predictions through a dual 3D Gaussian Splatting framework with shared geometry. It first refines segmentation maps by fitting two synchronized 3DGS models (RGB and segmentation) to enforce cross-view consistency, then renders and clusters to form an auxiliary dataset. A lightweight manual post-processing step cleans the masks, after which a baseline parser (BiSeNet) is fine-tuned, yielding improved accuracy on extreme poses while preserving frontal performance. Experiments on FaceScape with a small identity set show significant gains in mIoU and F1 and competitive perceptual scores in human studies, indicating scalability without ground-truth 3D annotations. The method is model-agnostic and scalable, offering a practical route to robust face parsing in real-world, pose-diverse scenarios.

Abstract

Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.

Paper Structure

This paper contains 9 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our two-stage label refinement pipeline. We use multiview images and coarse baseline predictions to fit dual 3DGS models with shared geometry, enabling multiview-consistent synthesis of dense segmentation renderings. These are clustered and manually refined to produce training labels used for fine-tuning the baseline model.
  • Figure 2: Face parsing on extreme view images (bottom) using the baseline model (middle) and our automatic 3DGS-based refinement (top). Our method produces cleaner, more consistent segmentations without manual supervision.
  • Figure 3: Face parsing on held-out test images (bottom) using our baseline model before (middle) and after fine-tuning on the auxiliary dataset using our method (top).
  • Figure 4: Face parsing results on frontal views, showing that our fine-tuned model maintains strong performance on standard poses.
  • Figure 5: Face parsing results on out-of-distribution images (leftmost column) using various models (columns left to right): RoI Tanh-polar Transformer lin2021roi, FaceXFormer narayan2024facexformer, SegFormer xie2021segformer, BiSeNet yu2018bisenet (baseline), and BiSeNet fine-tuned with our 3DGS-based auxiliary dataset. Despite using only 77 images, our method generalizes well to novel subjects and poses using minimally refined labels.