Table of Contents
Fetching ...

EasyPortrait -- Face Parsing and Portrait Segmentation Dataset

Karina Kvanchiani, Elizaveta Petrova, Karen Efremyan, Alexander Sautin, Alexander Kapitanov

TL;DR

The proposed dataset, EasyPortrait, is a new dataset that contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes and confirmed the best domain generalization ability among portrait segmentation datasets.

Abstract

Recently, video conferencing apps have become functional by accomplishing such computer vision-based features as real-time background removal and face beautification. Limited variability in existing portrait segmentation and face parsing datasets, including head poses, ethnicity, scenes, and occlusions specific to video conferencing, motivated us to create a new dataset, EasyPortrait, for these tasks simultaneously. It contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes. Inappropriate annotation masks from other datasets caused a revision of annotator guidelines, resulting in EasyPortrait's ability to process cases, such as teeth whitening and skin smoothing. The pipeline for data mining and high-quality mask annotation via crowdsourcing is also proposed in this paper. In the ablation study experiments, we proved the importance of data quantity and diversity in head poses in our dataset for the effective learning of the model. The cross-dataset evaluation experiments confirmed the best domain generalization ability among portrait segmentation datasets. Moreover, we demonstrate the simplicity of training segmentation models on EasyPortrait without extra training tricks. The proposed dataset and trained models are publicly available.

EasyPortrait -- Face Parsing and Portrait Segmentation Dataset

TL;DR

The proposed dataset, EasyPortrait, is a new dataset that contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes and confirmed the best domain generalization ability among portrait segmentation datasets.

Abstract

Recently, video conferencing apps have become functional by accomplishing such computer vision-based features as real-time background removal and face beautification. Limited variability in existing portrait segmentation and face parsing datasets, including head poses, ethnicity, scenes, and occlusions specific to video conferencing, motivated us to create a new dataset, EasyPortrait, for these tasks simultaneously. It contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes. Inappropriate annotation masks from other datasets caused a revision of annotator guidelines, resulting in EasyPortrait's ability to process cases, such as teeth whitening and skin smoothing. The pipeline for data mining and high-quality mask annotation via crowdsourcing is also proposed in this paper. In the ablation study experiments, we proved the importance of data quantity and diversity in head poses in our dataset for the effective learning of the model. The cross-dataset evaluation experiments confirmed the best domain generalization ability among portrait segmentation datasets. Moreover, we demonstrate the simplicity of training segmentation models on EasyPortrait without extra training tricks. The proposed dataset and trained models are publicly available.
Paper Structure (14 sections, 11 figures, 4 tables)

This paper contains 14 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The face parsing and portrait segmentation annotation examples from the EasyPortrait dataset.
  • Figure 2: Example of data collection pipeline. The image was annotated with individual pair classes by 5 crowd workers. The masks are averaged with the expert-verified one and merged to obtain the final segmentation mask.
  • Figure 3: Image resolution, brightness, subjects, and class separability analysis. a) image resolution distribution: samples overlap with equal transparency and density reveals quantity; b), c), d) image distribution by subjects in train, validation, and test sets, respectively (each bar represents the count of images recorded by a particular subject group); e) subjects’ countries distribution; f) subjects’ devices: only smartphones, personal computers, and tablets were used while recording; g) brightness distribution; h) mask area distribution.
  • Figure 4: The impact visualization of such dataset characteristics as a) sample amount, b) head pose diversity (both for face parsing), and c) sample amount for portrait segmentation task. Solid lines correspond to models trained and tested on the EasyPortrait dataset, whereas the dotted line is the model pretrained on the EasyPortrait and tested on other datasets (see the legend for details). We evaluated all the datasets discussed in \ref{['cross']}; however, for datasets without significant metric changes, we did not create visualizations. Note that all the plots have different scales.
  • Figure 5: Visual comparison of existing portrait segmentation datasets. One can notice high-frequency details (e.g. hair) in segmentation masks in samples from our dataset. The AiSeg aiseg dataset is not considered since it provides the extracted foreground images without corresponding annotation mask.
  • ...and 6 more figures