Table of Contents
Fetching ...

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

Fadi Boutros, Vitomir Štruc, Naser Damer

TL;DR

AdaDistill tackles the challenge of distilling knowledge to compact face-recognition models by embedding adaptive knowledge transfer into a margin-penalty softmax loss. It uses an exponential-moving-average mechanism to progressively refine teacher class centers and adjusts the distillation focus from sample-to-sample to sample-to-center as the student learns, without extra hyper-parameters. The approach yields strong improvements across diverse benchmarks (IJB-C/B, ICCV2021-MFR, MegaFace) and outperforms several SOTA KD methods while simplifying training compared to multi-phase distillation frameworks. This has practical implications for deploying efficient yet accurate FR systems on edge devices.

Abstract

Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. The proposed AdaDistill embeds the KD concept into the softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher. Being aware of the relatively low capacity of the compact student model, we propose to distill less complex knowledge at an early stage of training and more complex one at a later stage of training. This relative adjustment of the distilled knowledge is controlled by the progression of the learning capability of the student over the training iterations without the need to tune any hyper-parameters. Extensive experiments and ablation studies show that AdaDistill can enhance the discriminative learning capability of the student and demonstrate superiority over various state-of-the-art competitors on several challenging benchmarks, such as IJB-B, IJB-C, and ICCV2021-MFR

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

TL;DR

AdaDistill tackles the challenge of distilling knowledge to compact face-recognition models by embedding adaptive knowledge transfer into a margin-penalty softmax loss. It uses an exponential-moving-average mechanism to progressively refine teacher class centers and adjusts the distillation focus from sample-to-sample to sample-to-center as the student learns, without extra hyper-parameters. The approach yields strong improvements across diverse benchmarks (IJB-C/B, ICCV2021-MFR, MegaFace) and outperforms several SOTA KD methods while simplifying training compared to multi-phase distillation frameworks. This has practical implications for deploying efficient yet accurate FR systems on edge devices.

Abstract

Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. The proposed AdaDistill embeds the KD concept into the softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher. Being aware of the relatively low capacity of the compact student model, we propose to distill less complex knowledge at an early stage of training and more complex one at a later stage of training. This relative adjustment of the distilled knowledge is controlled by the progression of the learning capability of the student over the training iterations without the need to tune any hyper-parameters. Extensive experiments and ablation studies show that AdaDistill can enhance the discriminative learning capability of the student and demonstrate superiority over various state-of-the-art competitors on several challenging benchmarks, such as IJB-B, IJB-C, and ICCV2021-MFR
Paper Structure (16 sections, 8 equations, 5 figures, 3 tables)

This paper contains 16 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our adaptive KD approach. At an early stage of training, $f^s_i$ (from the student) is pushed to be close to the counterpart $f^t_i$ obtained from the teacher and far from other $f^t_j$ of different classes. As training goes on, $f^s_i$ is pushed to be close to its class center, $w_{yi}^t$ (distilled from the teacher) and far from all other class centers, $w_{j}^t$.
  • Figure 2: Overview of AdaDistill. A batch of samples $x$ from the $k-th$ training iteration is passed to $T$ and $S$ to output two sets of feature representations $f^t$ and $f^s$, respectively. Then, the class centers $w^{(k)}$ are calculated (Eq. \ref{['eq:w']}) based on the importance of hard samples and the learning capability of the student (Eq.\ref{['eq:w_alpha']}). Finally, the loss is calculated (Eq. \ref{['eq:arcdistill']}) and the weights of $S$ are updated.
  • Figure 3: Sample-sample and Sample-center similarity score distributions. Fig. \ref{['fig:sample_sample']} and \ref{['fig:sample_center']} presented an example of matching samples with others (of the same identity) and samples with their class center. Fig. \ref{['fig:scores_distribution']} matching samples with samples (of the same identity) achieved lower similarity scores than matching samples with their class centers.
  • Figure 4: Different distillation approaches using fixed class center (Fig. \ref{['fig:arcdistill']}) and adaptive class center (Figs. \ref{['fig:adadistill_earlystage']} and \ref{['fig:adadistill_latterstage']}). In ArcDistill (Figrue \ref{['fig:arcdistill']}), $f_i^s$ is pushed with penalty margin to fixed class center $w_{yi}^t$. Figs. \ref{['fig:adadistill_earlystage']} and \ref{['fig:adadistill_latterstage']} illustrate the adaptive estimation of class center $w_{yi}^t$ using exponential moving average (Eq. \ref{['eq:w']}). At an early stage of training (Fig. \ref{['fig:adadistill_earlystage']}), the $\alpha$ value (positive cosine similarity between $f_i^s$ and $f_i^t$) is small and the class center is close to $f_i^t$. As training progresses (Fig. \ref{['fig:adadistill_latterstage']}), $\alpha$ increases and the new class center is estimated to be close to the average of all $f_i^t$ of class $y_i$.
  • Figure 5: Fig. \ref{['fig:alpha_values']} (Left): Average $\alpha$ and $\alpha'$ values over the training iterations as defined in Eq.\ref{['eq:alpha']} and \ref{['eq:w_alpha']}, respectively. Fig. \ref{['fig:loss']} (Right): Loss values and average verification accuracies over training iterations of students trained with different loss functions. The losses are in solid lines and the average accuracies are in dashed. The average accuracies are calculated on five benchmarks, LFW LFWTech, CFP-FP cfp-fp, AgeDB-30 agedb, CA-LFW CALFW and CP-LFW CPLFWTech, described in Section \ref{['sec:dataset']}. AdaDistill ($\alpha$) and AdaDistill ($\alpha'$) facilitated better convergence and achieved higher accuracies in comparison to the ArcDistill and standalone ArcFace.