Table of Contents
Fetching ...

Self-Supervised Facial Representation Learning with Facial Region Awareness

Zheng Gao, Ioannis Patras

TL;DR

This work makes a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA), and explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions.

Abstract

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.

Self-Supervised Facial Representation Learning with Facial Region Awareness

TL;DR

This work makes a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA), and explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions.

Abstract

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Paper Structure (26 sections, 8 equations, 2 figures, 8 tables)

This paper contains 26 sections, 8 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of the proposed FRA framework. $\odot$ denotes cosine similarity. For each input image $\mathbf{x}$, its augmented views $\mathbf{x}_1$ and $\mathbf{x}_2$ are passed into two network branches to produce the global embeddings $\mathbf{z}_1$ and $\mathbf{z}_2$. In addition, we produce a set of heatmaps $\mathbf{M}_1$ and $\mathbf{M}_2$ indicating the local facial regions, via the correlation between the pixel features and "facial mask embeddings" computed from a set of learnable positional embeddings. Then we aggregate the feature map to obtain the local facial embeddings $\{\mathbf{z}^{m}_1\}$ and $\{\mathbf{z}^{m}_2\}$. The semantic consistency loss is applied to global embeddings and facial embeddings to maximize the similarity across augmented views. To learn such heatmaps, i.e., facial mask embeddings, we treat the facial mask embeddings as facial region clusters and propose a semantic relation loss to align the cluster assignments of each pixel feature over the facial region clusters between the online and momentum network.
  • Figure 2: Generation of heatmaps using learnable positional embeddings as facial queries and the feature maps as keys and values.