Table of Contents
Fetching ...

Fiducial Focus Augmentation for Facial Landmark Detection

Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik, Vineeth Balasubramanian

TL;DR

This work tackles facial landmark detection under challenging conditions by introducing Fiducial Focus Augmentation (FiFA), a patch-based augmentation that places black squares around landmark fiducials to embed facial structure as an inductive bias. It couples FiFA with a Siamese training scheme using Deep Canonical Correlation Analysis (DCCA) loss to enforce cross-view consistency between two augmented views, while employing a Transformer+CNN backbone (ViT-B/16 with anti-aliased hourglass modules and an FF-Parser) for robust heatmap-based landmark regression. The method demonstrates state-of-the-art performance on COFW, 300W, and AFLW, supported by extensive ablations showing the contributions of FiFA, DCCA, and the architectural components. Overall, FiFA enhances FLD robustness to pose, illumination, and occlusion, with potential applicability to other face-related tasks.

Abstract

Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.

Fiducial Focus Augmentation for Facial Landmark Detection

TL;DR

This work tackles facial landmark detection under challenging conditions by introducing Fiducial Focus Augmentation (FiFA), a patch-based augmentation that places black squares around landmark fiducials to embed facial structure as an inductive bias. It couples FiFA with a Siamese training scheme using Deep Canonical Correlation Analysis (DCCA) loss to enforce cross-view consistency between two augmented views, while employing a Transformer+CNN backbone (ViT-B/16 with anti-aliased hourglass modules and an FF-Parser) for robust heatmap-based landmark regression. The method demonstrates state-of-the-art performance on COFW, 300W, and AFLW, supported by extensive ablations showing the contributions of FiFA, DCCA, and the architectural components. Overall, FiFA enhances FLD robustness to pose, illumination, and occlusion, with potential applicability to other face-related tasks.

Abstract

Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
Paper Structure (11 sections, 8 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 11 sections, 8 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of the proposed Fiducial Focus Augmentation (FiFA). In row (a), 5$\times$5 black patches are created around the landmark joints (along with other standard augmentations) in the initial epochs and reduced over the epochs. Rows (b) and (c) show corresponding GradCAM-based saliency maps of the network's last layer with and without FiFA, respectively. It is clearly seen that activations are more prominent around the desired landmarks when FiFA is used as additional augmentation.
  • Figure 2: An overview of the proposed Siamese-based framework. PPE = Patch + Position Embeddings; RB = Residual Block; MHA = Multi-Head Attention, MLP = Multi-Layer Perceptron; CBP = Convolution+BlurPool; BU = Bilinear Upsampling; FFP = FF-Parser.
  • Figure 3: Qualitative results on WFLW testset. Landmarks shown in green are produced by our method, while the ones in red by the state-of-the-art approach of farl.