Table of Contents
Fetching ...

SegFace: Face Segmentation of Long-Tail Classes

Kartik Narayan, Vibashan VS, Vishal M. Patel

TL;DR

Face parsing increasingly needs strong performance on long-tail classes such as accessories. SegFace introduces a lightweight transformer decoder with learnable class-specific tokens, coupled with multi-scale feature extraction and MLP fusion, to model each facial class independently while leveraging shared backbone features. It achieves state-of-the-art mean F1 on LaPa (93.03) and CelebAMask-HQ (88.96), and supports edge-friendly inference at 95.96 FPS, with large gains on long-tail categories like earrings and necklaces. The approach advances robust, real-time face parsing and highlights the potential of class-specific tokenization for addressing data imbalance in semantic segmentation.

Abstract

Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called long-tail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN-based, tend to be dominated by head classes during training, resulting in suboptimal representation for long-tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long-tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer-based model which utilizes learnable class-specific tokens. The transformer decoder leverages class-specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long-tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low-compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65) on the LaPa dataset. Code: https://github.com/Kartik-3004/SegFace

SegFace: Face Segmentation of Long-Tail Classes

TL;DR

Face parsing increasingly needs strong performance on long-tail classes such as accessories. SegFace introduces a lightweight transformer decoder with learnable class-specific tokens, coupled with multi-scale feature extraction and MLP fusion, to model each facial class independently while leveraging shared backbone features. It achieves state-of-the-art mean F1 on LaPa (93.03) and CelebAMask-HQ (88.96), and supports edge-friendly inference at 95.96 FPS, with large gains on long-tail categories like earrings and necklaces. The approach advances robust, real-time face parsing and highlights the potential of class-specific tokenization for addressing data imbalance in semantic segmentation.

Abstract

Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called long-tail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN-based, tend to be dominated by head classes during training, resulting in suboptimal representation for long-tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long-tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer-based model which utilizes learnable class-specific tokens. The transformer decoder leverages class-specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long-tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low-compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65) on the LaPa dataset. Code: https://github.com/Kartik-3004/SegFace

Paper Structure

This paper contains 17 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The proposed SegFace leverages a lightweight transformer decoder with learnable class-specific tokens. The association of each class with a token enables the independent modeling of each class, which boosts the segmentation performance of long-tail classes that typically underperform in existing methods. The blue line represents the probability of a class being present in a randomly selected image from the CelebAMask-HQ train set. SegFace provides a significant boost in the segmentation performance of long-tail classes ($+7.9$, $+21.2$), thereby establishing a new state-of-the-art in face parsing performance.
  • Figure 2: The proposed architecture, SegFace, addresses face segmentation by enhancing the performance on long-tail classes through a transformer-based approach. Specifically, multi-scale features are first extracted from an image encoder and then fused using an MLP fusion module to form face tokens. These tokens, along with class-specific tokens, undergo self-attention, face-to-token, and token-to-face cross-attention operations, refining both class and face tokens to enhance class-specific features. Finally, the upscaled face tokens and learned class tokens are combined to produce segmentation maps for each facial region.
  • Figure 3: The qualitative comparison highlights the superior performance of our method, SegFace, compared to DML-CSR. In (a), SegFace effectively segments both long-tail classes like earrings and necklaces as well as head classes such as hair and neck. In (b), it also excels in challenging scenarios involving multiple faces, human-resembling features, poor lighting, and occlusion, where DML-CSR struggles.
  • Figure 4: (a) Class-specific tokens segment their corresponding classes, showcasing the independent modeling of each class. (b) Comparison of noisy ground truth with prediction from SegFace
  • Figure 5: Additional qualitative comparison of our proposed method, SegFace, compared to DML-CSR on the (a) CelebAMask-HQ and (b) LaPa dataset.