Table of Contents
Fetching ...

ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection

Jui-Che Chiang, Hou-Ning Hu, Bo-Syuan Hou, Chia-Yu Tseng, Yu-Lun Liu, Min-Hung Chen, Yen-Yu Lin

TL;DR

ORFormer introduces messenger tokens to a transformer to detect occlusions and recover missing features within a single image, enabling robust heatmap generation for facial landmark detection. A two-branch system is used: a quantized heatmap generator pretrains a codebook and decoder, while ORFormer uses regular and messenger tokens to produce an occlusion map $\alpha$ and recovered feature $Z_{rec}$ by fusing $Z_I$ and $Z_M$, guided by $\alpha$. The recovered heatmaps are integrated with existing FLD methods, yielding competitive results on WFLW, COFW, and 300W, with notable gains under occlusion and extreme poses. This approach advances practical FLD robustness and suggests broader applicability of occlusion-aware transformers for feature recovery in vision tasks.

Abstract

Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.

ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection

TL;DR

ORFormer introduces messenger tokens to a transformer to detect occlusions and recover missing features within a single image, enabling robust heatmap generation for facial landmark detection. A two-branch system is used: a quantized heatmap generator pretrains a codebook and decoder, while ORFormer uses regular and messenger tokens to produce an occlusion map and recovered feature by fusing and , guided by . The recovered heatmaps are integrated with existing FLD methods, yielding competitive results on WFLW, COFW, and 300W, with notable gains under occlusion and extreme poses. This approach advances practical FLD robustness and suggests broader applicability of occlusion-aware transformers for feature recovery in vision tasks.

Abstract

Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.

Paper Structure

This paper contains 47 sections, 13 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Overview of our ORFormer. (a) For each patch $P_i$, we introduce a patch token $X_i$ and a learnable messenger token $M_i$ for occlusion detection and handling. (b) The messenger token computes attention with patch tokens other than its corresponding one. (c) We detect occlusion by evaluating the dissimilarity between the regular embedding $X_i'$ and the messenger embedding $M_i'$, and then recover occluded features based on the messenger embedding which is aggregated from other image patches, if occlusion is present in patch $P_i$.
  • Figure 2: Overview of our method. (a) We first train a quantized heatmap generator, which takes an image $I$ as input and generates its edge heatmaps $H$. After pre-training, the prior knowledge of unoccluded faces is encoded in the codebook $C$ and decoder $D$. (b) With the frozen codebook and decoder, we introduce ORFormer to generate the occlusion map $\alpha$ and two code sequences $S_I$ and $S_M$, leading to quantized features $Z_I$ and $Z_M$. The recovered feature $Z_\text{rec}$ is yielded by merging $Z_I$ and $Z_M$ with patch-specific weights given in $\alpha$, and is used to produce occlusion-robust heatmaps $H_\text{rec}$.
  • Figure 3: Network architecture of ORFormer. ORFormer takes image patches $P$ as input and generates two code sequences $S_I$ and $S_M$ via the codebook prediction head. While $S_I$ is computed by referring to the image patch tokens, $S_M$ is by the messenger tokens. The occlusion map $\alpha$ represents the patch-specific occlusion likelihood and is inferred by the occlusion detection head.
  • Figure 3: Quantitative comparison for heatmap generation on WFLW. Heatmap regression L2 loss is reported for all subsets. The relative performance gain, given in parentheses, is calculated from the baseline VQVAE van2017neural. Text in bold indicates a method gets a larger relative gain on that subset over the full set.
  • Figure 4: Integration of ORFormer into an existing FLD method. ORFormer is adopted for occlusion detection and feature recovery, resulting in high-quality heatmaps. The generated heatmaps serve as an extra input to an FLD method, and offer the recovered features to make the FLD method robust to occlusions.
  • ...and 7 more figures