DiffProb: Data Pruning for Face Recognition
Eduarda Caldeira, Jan Niklas Kolf, Naser Damer, Fadi Boutros
TL;DR
This paper introduces DiffProb, a data pruning method for face recognition that removes redundant samples within each identity by using $p(x_i)$, the predicted probability of the ground-truth class, and a tunable threshold $t$, while enforcing a minimum per-identity sample count. An auxiliary cleaning mechanism further removes mislabeled or label-flipped samples, significantly boosting data quality and FR performance. Across CASIA-WebFace pruning ratios up to 50% and benchmarks including LFW, CFP-FP, AgeDB-30, CALFW, CPLFW, and IJB-C, DiffProb maintains or improves verification accuracy and generalizes across losses (CosFace, AdaFace, CurricularFace) and architectures (ResNet-50, ResNet-34). The approach reduces training cost and data volume, enhances robustness to labeling noise, and points toward privacy-conscious training through identity-aware pruning in future work. Overall, DiffProb represents the first principled FR-focused data pruning method with demonstrated practical impact on large-scale, privacy-sensitive face datasets.
Abstract
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
