Table of Contents
Fetching ...

Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer

Jun Wan, He Liu, Yujia Wu, Zhihui Lai, Wenwen Min, Jun Liu

TL;DR

The paper tackles precise facial landmark detection under challenging conditions such as large pose and occlusion. It introduces Dynamic Semantic Aggregation Transformer (DSAT), which combines Dynamic Semantic-Aware (DSA) sample partitioning with Dynamic Semantic Specialization (DSS) cross-scale feature aggregation to learn more discriminative, specialized features. A boundary heatmap auxiliary task and an hourglass backbone support robust multi-scale learning, with DSAT implemented as a dynamic architecture and dynamic parameter framework. Empirical results on AFLW, 300W, WFLW, and COFW show state-of-the-art accuracy and stronger occlusion/pose robustness, underscored by detailed ablations and analyses; code is publicly available.

Abstract

At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the literature.Our code is available at https://github.com/GERMINO-LiuHe/DSAT.

Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer

TL;DR

The paper tackles precise facial landmark detection under challenging conditions such as large pose and occlusion. It introduces Dynamic Semantic Aggregation Transformer (DSAT), which combines Dynamic Semantic-Aware (DSA) sample partitioning with Dynamic Semantic Specialization (DSS) cross-scale feature aggregation to learn more discriminative, specialized features. A boundary heatmap auxiliary task and an hourglass backbone support robust multi-scale learning, with DSAT implemented as a dynamic architecture and dynamic parameter framework. Empirical results on AFLW, 300W, WFLW, and COFW show state-of-the-art accuracy and stronger occlusion/pose robustness, underscored by detailed ablations and analyses; code is publicly available.

Abstract

At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the literature.Our code is available at https://github.com/GERMINO-LiuHe/DSAT.

Paper Structure

This paper contains 17 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The ratios of activated channels for different samples. The harder the face image, the greater the ratio of activated channels. This indicates that our proposed DSAT could learn specialized features according to the difficulty estimation of the face images for achieving more precise face alignment.
  • Figure 2: The overall architecture of the proposed Dynamic Semantic Aggregation Transformer (DSAT). The proposed DSAT embeds the Dynamic Semantic-Aware model and Dynamic Semantic Specialization model in dynamic architecture and parameter manner to learn more effective specialized features for achieving more precise face alignment.
  • Figure 3: The architecture of the proposed Dynamic Semantic-Aware (DSA) model. The DSA model is able to partition all training samples into subsets and activate specific pathways for each sample subset by estimating the semantic correlations of feature channels, which helps to learn more specialized features for enhancing representation ability.
  • Figure 4: The network architecture of the DSS model. With a well-designed Cross-Channel Attention (CCA) module, the DSS model is proposed to mine homogeneous information between them by querying features at different scales, thereby making up the semantic gap between them and eliminating semantics ambiguity. Moreover, the DSS model can also work in a dynamic feature manner to save computational costs and enhance feature representations.
  • Figure 5: The image shows the average number of DSA model channels activated on WLFW datasets. For different types of images, the DSA model will activate different numbers of model channels.
  • ...and 3 more figures