Table of Contents
Fetching ...

Periocular Embedding Learning with Consistent Knowledge Distillation from Face

Yoon Gyo Jung, Jaewoo Park, Cheng Yaw Low, Jacky Chen Long Chai, Leslie Ching Ow Tiong, Andrew Beng Jin Teoh

TL;DR

Periocular recognition suffers from limited discriminative cues in occluded or masked settings. The authors introduce Consistent Knowledge Distillation (CKD), a single-stage training framework that transfers global inter-class relationships learned from face data to a periocular network through prediction- and feature-layer consistency, with a single hyperparameter $\tau$. They show CKD is equivalent to learned-label smoothing augmented by a sparsity-oriented regularizer, enabling robust relational embedding transfer and improved calibration. Across six standard periocular benchmarks, CKD achieves state-of-the-art performance and demonstrates strong generalization to low-resolution faces and general face verification protocols, supported by theoretical and empirical analyses of the regularizer and cross-domain consistency.

Abstract

Periocular biometric, the peripheral area of the ocular, is a collaborative alternative to the face, especially when the face is occluded or masked. However, in practice, sole periocular biometric capture the least salient facial features, thereby lacking discriminative information, particularly in wild environments. To address these problems, we transfer discriminatory information from the face to support the training of a periocular network by using knowledge distillation. Specifically, we leverage face images for periocular embedding learning, but periocular alone is utilized for identity identification or verification. To enhance periocular embeddings by face effectively, we proposeConsistent Knowledge Distillation (CKD) that imposes consistency between face and periocular networks across prediction and feature layers. We find that imposing consistency at the prediction layer enables (1) extraction of global discriminative relationship information from face images and (2) effective transfer of the information from the face network to the periocular network. Particularly, consistency regularizes the prediction units to extract and store profound inter-class relationship information of face images. (3) The feature layer consistency, on the other hand, makes the periocular features robust against identity-irrelevant attributes. Overall, CKD empowers the sole periocular network to produce robust discriminative embeddings for periocular recognition in the wild. We theoretically and empirically validate the core principles of the distillation mechanism in CKD, discovering that CKD is equivalent to label smoothing with a novel sparsity-oriented regularizer that helps the network prediction to capture the global discriminative relationship. Extensive experiments reveal that CKD achieves state-of-the-art results on standard periocular recognition benchmark datasets.

Periocular Embedding Learning with Consistent Knowledge Distillation from Face

TL;DR

Periocular recognition suffers from limited discriminative cues in occluded or masked settings. The authors introduce Consistent Knowledge Distillation (CKD), a single-stage training framework that transfers global inter-class relationships learned from face data to a periocular network through prediction- and feature-layer consistency, with a single hyperparameter . They show CKD is equivalent to learned-label smoothing augmented by a sparsity-oriented regularizer, enabling robust relational embedding transfer and improved calibration. Across six standard periocular benchmarks, CKD achieves state-of-the-art performance and demonstrates strong generalization to low-resolution faces and general face verification protocols, supported by theoretical and empirical analyses of the regularizer and cross-domain consistency.

Abstract

Periocular biometric, the peripheral area of the ocular, is a collaborative alternative to the face, especially when the face is occluded or masked. However, in practice, sole periocular biometric capture the least salient facial features, thereby lacking discriminative information, particularly in wild environments. To address these problems, we transfer discriminatory information from the face to support the training of a periocular network by using knowledge distillation. Specifically, we leverage face images for periocular embedding learning, but periocular alone is utilized for identity identification or verification. To enhance periocular embeddings by face effectively, we proposeConsistent Knowledge Distillation (CKD) that imposes consistency between face and periocular networks across prediction and feature layers. We find that imposing consistency at the prediction layer enables (1) extraction of global discriminative relationship information from face images and (2) effective transfer of the information from the face network to the periocular network. Particularly, consistency regularizes the prediction units to extract and store profound inter-class relationship information of face images. (3) The feature layer consistency, on the other hand, makes the periocular features robust against identity-irrelevant attributes. Overall, CKD empowers the sole periocular network to produce robust discriminative embeddings for periocular recognition in the wild. We theoretically and empirically validate the core principles of the distillation mechanism in CKD, discovering that CKD is equivalent to label smoothing with a novel sparsity-oriented regularizer that helps the network prediction to capture the global discriminative relationship. Extensive experiments reveal that CKD achieves state-of-the-art results on standard periocular recognition benchmark datasets.

Paper Structure

This paper contains 24 sections, 6 theorems, 11 equations, 9 figures, 6 tables.

Key Result

Theorem 1

Let $\mathbf{p}_\tau$ denote the softmax probability of the periocular logit $\mathbf{z}$ divided by $\tau$, $p_{\tau,k} = e^{z_k/\tau} / \sum_i e^{z_i/\tau}$, and likewise for face, $\mathbf{p}^F_\tau$. Let $\mathbf{p}=\mathbf{p}_1$ and $\mathbf{p}^F = \mathbf{p}^F_1$ be the softmax posteriors with up to scale where $H(\mathbf{y}, \mathbf{p}) = - \sum_k y_k \log p_k$ is cross entropy, $\widetilde

Figures (9)

  • Figure 1: Comparison of different knowledge distillation methods. Knowledge distillation (KD) neither captures discriminative relationship information from face images nor effectively transfers it. Mutual learning (ML) only resolves information transfer. Relational knowledge distillation (RKD) can capture relationships between faces but only locally, and its transfer is ineffective. Our CKD can capture the global inter-class relationships between faces and effectively transfer that information to the periocular network. Moreover, by consistency in feature layers via shared weights batch statistics, CKD extracts periocular features robust against identity-irrelevant attributes.
  • Figure 2: The network architecture of CKD. A paired input contains a face and a periocular region feed forwarded to a shared-weights network, followed by respective projection heads. The whole network in the figure is trained, and only the colored area is used in the testing stage for periocular recognition.
  • Figure 3: The visualization of of the regularizer $R(\mathbf{z})$ by observing the exponent of its negative $\sum_{k=1}^K e^{z_k} / (\sum_{k=1}^K e^{z_k/\tau})^\tau = \exp(-R(\mathbf{z}))$ with $K {=} 2$ and $\tau =$ 1.25, 2.5, and 5. (First row) Its 2-D heatmap visualization (Second row) is the corresponding 3-D visualization. The plots show that the regularizer is minimized when either of $z_k$ is maximized, and the other $z_i$ is minimized; namely, $R(\mathbf{z})$ is minimized when the softmax $\mathbf{p}$ (posterior) of the logit $\mathbf{z}$ converges to one-hot. Hence, $R(\mathbf{z})$as a regularizer prevents over-smoothing of the predictions$\mathbf{p}$. In terms of the temperature $\tau$, on the other hand, increasing $\tau$ decreases the exponent, thereby increasing the upper bound of $R(\mathbf{z})$. Thus, a large temperature $\tau$ enhances the impact of regularizer $R(\mathbf{z})$. The theoretical observation of $R(\mathbf{z})$ is given in Proposition \ref{['thm:reg']}.
  • Figure 4: The softmax outputs of different models. By cross entropy (CE), the face network learns sparse predictions unless it underfits. Label smoothing alone may over-smooth the prediction and not necessarily capture the inter-class relationships. Our CKD, however, smoothes the prediction only to a moderate degree and captures an inter-class relationship in the non-target posterior $p(k \neq y | \mathbf{x}^F)$.
  • Figure 5: The performance comparison for periocular identification (the first row) and verification (the second row) was measured by CMC and indicated by the ROC curve, respectively. The black line is of our proposed model CKD. (First column) The results of the ablation study are in Sec. \ref{['sec:exp_ablation']}. (Second column) Experimental comparison with state-of-the-art periocular recognition methods in Sec. \ref{['sec:exp_comp_peri']}. (Third column) Experimental comparison with different KD methods enhances the periocular network by self-distillation from ocular images (Sec. \ref{['sec:exp_comp_kd']}). (Fourth column) Experimental comparison with different KD methods enhances the periocular network by knowledge distillation from face images (Sec. \ref{['sec:exp_comp_kd']}).
  • ...and 4 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Lemma 2
  • proof : Proof of Lemma \ref{['thm:ckd_smooth_lemma']}
  • proof : Proof of Theorem \ref{['thm:ckd_smooth']}
  • Proposition 3
  • proof
  • Corollary 4
  • Proposition 5
  • proof
  • Proposition 6
  • ...and 1 more