Table of Contents
Fetching ...

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Zhonglin Sun, Chen Feng, Ioannis Patras, Georgios Tzimiropoulos

TL;DR

This work focuses on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels, and incorporates two landmark-specific augmen-tations which introduce more diversity of landmark information to further regularize the learning.

Abstract

In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

TL;DR

This work focuses on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels, and incorporates two landmark-specific augmen-tations which introduce more diversity of landmark information to further regularize the learning.

Abstract

In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
Paper Structure (44 sections, 8 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 44 sections, 8 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Illustration of our pretraining and finetuning pipeline for face recognition. First, a landmark CNN is learnt using Part fViT framework Sun_2022_BMVC. We adopt the landmark CNN to provide facial landmarks for constructing our LAFSpretaining. In this framework, the 'Teacher' processes the entire set of provided landmarks, while the 'Student' operates on subsets of these landmarks. Then, we transfer the 'Teacher' for finetuning with an additional regularization that penalizes landmark predictions from huge variations.
  • Figure 2: (A)The pipeline of our proposed LAFS framework. Two views of a facial image are first processed by the landmark CNN to provide landmark localization. Then we sample a certain subset of landmarks on the student branch. Following that, landmarks-based augmentations are added before converting into embedding for processing by teacher and student backbones. The representations of the two views are compared by the output of the backbones without label information. Gradients are backpropagated to the student network and the teacher network is updated by the exponential moving average of student parameters. (B)Landmark Augmentations. The upper part is shuffling where the order for sending to fViT is ①, ②, ③, ④, after shuffling the order changed. The bottom part explains the coordinates variation given the perturbation, each position of the green point shifts to the red point after the perturbation.