Table of Contents
Fetching ...

KeyPoint Relative Position Encoding for Face Recognition

Minchul Kim, Yiyang Su, Feng Liu, Anil Jain, Xiaoming Liu

TL;DR

KP-RPE is a novel method, which leverages key points (e.g. facial landmarks) to make ViT more resilient to scale, translation, and pose variations, and demonstrates the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure.

Abstract

In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.

KeyPoint Relative Position Encoding for Face Recognition

TL;DR

KP-RPE is a novel method, which leverages key points (e.g. facial landmarks) to make ViT more resilient to scale, translation, and pose variations, and demonstrates the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure.

Abstract

In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.
Paper Structure (38 sections, 14 equations, 13 figures, 9 tables)

This paper contains 38 sections, 14 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Toy Example illustrating how different Position Embeddings impact the ViT's robustness to unseen affine transforms. Abs-PE refers to the learnable Absolute Position Embedding. RPE and iRPE refers to Relative Position Embedding adopted to ViT huang-etal-2020-improvewu2021rethinking. Keypoints in MNIST is arbitrarily defined to be the four corners of a box that covers a digit. Abs-PE* is drawing the keypoints onto the input image. KP-RPE uses the keypoints to adjust the RPE.
  • Figure 2: Illustration of RPE shaw2018self and proposed KP-RPE. The blue arrow represents the learned attention offset $\mathbf{B}_{ij}$ between a query $i$ and key $j$ of attention in RPE. The query-key relationship at the same $i$ and $j$ should represent different relationships as the scale or pose change. But $\mathbf{B}_{ij}$ does not change in RPE. KP-RPE addresses this issue by incorporating the distance to the keypoints when calculating the learned attention offset in RPE.
  • Figure 3: Depiction of key-query combinations in an image, given a query location $i=(7,7)$ ($\star$). Distinct colors represent varying attention offset values in RPE based on the distance between $i$ and $j$. We are showing $\mathbf{B}_{i=(7,7),j}$ for all $j\in{(14\times14})$. a) The distance function is a quantized Euclidean distance. b) Product distance proposed in iRPE accounts for direction. c) We adopt b) and allow $\mathbf{B}_{i,j}$ to vary based on keypoint locations ($\bullet$).
  • Figure 4: a) Illustration of KP-RPE. First a mesh grid $\mathbf{M}$ and an image-specific keypoints $\mathbf{P}$ are generated. Then the broadcasted difference $\mathbf{D}$ is calculated, and we linearly map $\mathbf{D}$ to $f(\mathbf{P})$. Finally for a given $i,j$, we can find the $\mathbf{B}_{ij}=f(\mathbf{P})[i, d(i,j)])$, which is used to adjust the attention map in self-attention. b) Backbone contains multiple transformer blocks followed by an MLP for classification. KP-RPE is used where multi-head attention modules exist. KP-RPE is efficient as $f(\mathbf{P})$ is computed once.
  • Figure 5: Plot of Verification Accuracy in CFPFP cfpfp. On the X-axis, we interpolate the affine transformation from raw data (Detection Image) to canonical alignment (Alignment Image). Note KP-RPE is robust to affine transformations, while all models have been trained on the aligned image dataset.
  • ...and 8 more figures