Table of Contents
Fetching ...

DiffProb: Data Pruning for Face Recognition

Eduarda Caldeira, Jan Niklas Kolf, Naser Damer, Fadi Boutros

TL;DR

This paper introduces DiffProb, a data pruning method for face recognition that removes redundant samples within each identity by using $p(x_i)$, the predicted probability of the ground-truth class, and a tunable threshold $t$, while enforcing a minimum per-identity sample count. An auxiliary cleaning mechanism further removes mislabeled or label-flipped samples, significantly boosting data quality and FR performance. Across CASIA-WebFace pruning ratios up to 50% and benchmarks including LFW, CFP-FP, AgeDB-30, CALFW, CPLFW, and IJB-C, DiffProb maintains or improves verification accuracy and generalizes across losses (CosFace, AdaFace, CurricularFace) and architectures (ResNet-50, ResNet-34). The approach reduces training cost and data volume, enhances robustness to labeling noise, and points toward privacy-conscious training through identity-aware pruning in future work. Overall, DiffProb represents the first principled FR-focused data pruning method with demonstrated practical impact on large-scale, privacy-sensitive face datasets.

Abstract

Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.

DiffProb: Data Pruning for Face Recognition

TL;DR

This paper introduces DiffProb, a data pruning method for face recognition that removes redundant samples within each identity by using , the predicted probability of the ground-truth class, and a tunable threshold , while enforcing a minimum per-identity sample count. An auxiliary cleaning mechanism further removes mislabeled or label-flipped samples, significantly boosting data quality and FR performance. Across CASIA-WebFace pruning ratios up to 50% and benchmarks including LFW, CFP-FP, AgeDB-30, CALFW, CPLFW, and IJB-C, DiffProb maintains or improves verification accuracy and generalizes across losses (CosFace, AdaFace, CurricularFace) and architectures (ResNet-50, ResNet-34). The approach reduces training cost and data volume, enhances robustness to labeling noise, and points toward privacy-conscious training through identity-aware pruning in future work. Overall, DiffProb represents the first principled FR-focused data pruning method with demonstrated practical impact on large-scale, privacy-sensitive face datasets.

Abstract

Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.

Paper Structure

This paper contains 17 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Histograms of the genuine and impostor score distributions of CASIA-WebFace DBLP:journals/corr/YiLLL14a when evaluated by an FR model trained on MS1MV2 DBLP:conf/cvpr/DengGXZ19. One can notice that the genuine score distribution (green) contains a left wing of comparison scores that highly overlapped with the impostor distribution, highlighting the presence of mislabeled and label-flipped samples in the CASIA-WebFace dataset.
  • Figure 2: Visual representation of two distinct identities of the CASIA-WebFace dataset and their processing by our DiffProb pruning method. Translucent images correspond to the samples pruned when $t=0.00003$, which leads to pruning 25% of the dataset samples across all identities. The values below each sample $x_i$ represent their $p(x_i)$ value, truncated to four decimal places. Note that samples assigned to the same truncated value correspond to different $p(x_i)$ values when full precision is considered and that the samples are ordered in ascending order of their full precision $p(x_i)$ from left to right. This justifies why some samples within the same range are pruned and others are not. It can also be observed that hard samples, such as the ones present in the left part of the figure, are not the first to be pruned by our DiffProb method, which is beneficial as they can contribute to the FR model learning process.
  • Figure 3: A set of samples categorized as belonging to the same identity on the CASIA-WebFace dataset DBLP:journals/corr/YiLLL14a. The red box highlights the three samples considered as mislabeled or label-flipped by our auxiliary data cleaning mechanism. The green box contains examples of samples that were not removed by our data cleaning approach.