Table of Contents
Fetching ...

RetinaFace: Single-stage Dense Face Localisation in the Wild

Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, Stefanos Zafeiriou

TL;DR

RetinaFace introduces a single-stage, pixel-wise dense face localisation framework that combines face detection, five landmark regression, and a self-supervised dense 3D face branch within a multi-task loss. By leveraging extra landmark annotations and a mesh-decoder-based dense regression with graph convolutions and differentiable rendering, it achieves state-of-the-art results on the challenging WIDER FACE Hard subset (AP 91.4%) and improves face verification performance (IJB-C TAR 89.59% at FAR 1e-6) when used with ArcFace. The approach remains efficient enough for real-time CPU inference with lightweight backbones, and the authors provide extensive ablations, demonstrations on landmark and dense regression benefits, and publicly release data and code. These contributions advance robust, scalable face localisation in the wild and its downstream recognition tasks.

Abstract

Though tremendous strides have been made in uncontrolled face detection, accurate and efficient face localisation in the wild remains an open challenge. This paper presents a robust single-stage face detector, named RetinaFace, which performs pixel-wise face localisation on various scales of faces by taking advantages of joint extra-supervised and self-supervised multi-task learning. Specifically, We make contributions in the following five aspects: (1) We manually annotate five facial landmarks on the WIDER FACE dataset and observe significant improvement in hard face detection with the assistance of this extra supervision signal. (2) We further add a self-supervised mesh decoder branch for predicting a pixel-wise 3D shape face information in parallel with the existing supervised branches. (3) On the WIDER FACE hard test set, RetinaFace outperforms the state of the art average precision (AP) by 1.1% (achieving AP equal to 91.4%). (4) On the IJB-C test set, RetinaFace enables state of the art methods (ArcFace) to improve their results in face verification (TAR=89.59% for FAR=1e-6). (5) By employing light-weight backbone networks, RetinaFace can run real-time on a single CPU core for a VGA-resolution image. Extra annotations and code have been made available at: https://github.com/deepinsight/insightface/tree/master/RetinaFace.

RetinaFace: Single-stage Dense Face Localisation in the Wild

TL;DR

RetinaFace introduces a single-stage, pixel-wise dense face localisation framework that combines face detection, five landmark regression, and a self-supervised dense 3D face branch within a multi-task loss. By leveraging extra landmark annotations and a mesh-decoder-based dense regression with graph convolutions and differentiable rendering, it achieves state-of-the-art results on the challenging WIDER FACE Hard subset (AP 91.4%) and improves face verification performance (IJB-C TAR 89.59% at FAR 1e-6) when used with ArcFace. The approach remains efficient enough for real-time CPU inference with lightweight backbones, and the authors provide extensive ablations, demonstrations on landmark and dense regression benefits, and publicly release data and code. These contributions advance robust, scalable face localisation in the wild and its downstream recognition tasks.

Abstract

Though tremendous strides have been made in uncontrolled face detection, accurate and efficient face localisation in the wild remains an open challenge. This paper presents a robust single-stage face detector, named RetinaFace, which performs pixel-wise face localisation on various scales of faces by taking advantages of joint extra-supervised and self-supervised multi-task learning. Specifically, We make contributions in the following five aspects: (1) We manually annotate five facial landmarks on the WIDER FACE dataset and observe significant improvement in hard face detection with the assistance of this extra supervision signal. (2) We further add a self-supervised mesh decoder branch for predicting a pixel-wise 3D shape face information in parallel with the existing supervised branches. (3) On the WIDER FACE hard test set, RetinaFace outperforms the state of the art average precision (AP) by 1.1% (achieving AP equal to 91.4%). (4) On the IJB-C test set, RetinaFace enables state of the art methods (ArcFace) to improve their results in face verification (TAR=89.59% for FAR=1e-6). (5) By employing light-weight backbone networks, RetinaFace can run real-time on a single CPU core for a VGA-resolution image. Extra annotations and code have been made available at: https://github.com/deepinsight/insightface/tree/master/RetinaFace.

Paper Structure

This paper contains 16 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The proposed single-stage pixel-wise face localisation method employs extra-supervised and self-supervised multi-task learning in parallel with the existing box classification and regression branches. Each positive anchor outputs (1) a face score, (2) a face box, (3) five facial landmarks, and (4) dense 3D face vertices projected on the image plane.
  • Figure 2: An overview of the proposed single-stage dense face localisation approach. RetinaFace is designed based on the feature pyramids with independent context modules. Following the context modules, we calculate a multi-task loss for each anchor.
  • Figure 3: (a) 2D Convolution is kernel-weighted neighbour sum within the Euclidean grid receptive field. Each convolutional layer has $Kernel_H \times Kernel_W \times Channel_{in} \times Channel_{out}$ parameters. (b) Graph convolution is also in the form of kernel-weighted neighbour sum, but the neighbour distance is calculated on the graph by counting the minimum number of edges connecting two vertices. Each convolutional layer has $K \times Channel_{in} \times Channel_{out}$ parameters and the Chebyshev coefficients $\theta_{i,j} \in \mathbb{R}^K$ are truncated at order $K$.
  • Figure 4: We add extra annotations of five facial landmarks on faces that can be annotated (we call them "annotatable") from the WIDER FACE training and validation sets.
  • Figure 5: Precision-recall curves on the WIDER FACE validation and test subsets.
  • ...and 4 more figures