Table of Contents
Fetching ...

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

Kejia Yin, Varshanth R. Rao, Ruowei Jiang, Xudong Liu, Parham Aarabi, David B. Lindell

TL;DR

SCE-MAE tackles self-supervised facial landmark estimation by combining region-level MAE pretraining with a selective correspondence refinement. The Corrrespondence Approximation and Refinement Block (CARB) uses density-peak clustering to proxy inattentive, non-landmark regions and employs a Locality-Constrained Repellence (LCR) loss to refine only salient local correspondences, yielding sharper landmark representations. Across landmark matching and detection benchmarks, the approach substantially outperforms state-of-the-art methods, including under limited annotations and with smaller backbones, by leveraging MAE-friendly representations and targeted refinement. The framework offers robust, high-fidelity landmark localization with potential for more efficient, sparse correspondence computation in future work.

Abstract

Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task, existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms, which neglect the dense prediction nature of the task, (2) aggregate them into memory-intensive hypercolumn formations, and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper, we introduce SCE-MAE, a framework that (1) leverages the MAE, a region-level SSL method that naturally better suits the landmark prediction task, (2) operates on the vanilla feature map instead of on expensive hypercolumns, and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust, outperforming existing SOTA methods by large margins of approximately 20%-44% on the landmark matching and approximately 9%-15% on the landmark detection tasks.

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

TL;DR

SCE-MAE tackles self-supervised facial landmark estimation by combining region-level MAE pretraining with a selective correspondence refinement. The Corrrespondence Approximation and Refinement Block (CARB) uses density-peak clustering to proxy inattentive, non-landmark regions and employs a Locality-Constrained Repellence (LCR) loss to refine only salient local correspondences, yielding sharper landmark representations. Across landmark matching and detection benchmarks, the approach substantially outperforms state-of-the-art methods, including under limited annotations and with smaller backbones, by leveraging MAE-friendly representations and targeted refinement. The framework offers robust, high-fidelity landmark localization with potential for more efficient, sparse correspondence computation in future work.

Abstract

Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task, existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms, which neglect the dense prediction nature of the task, (2) aggregate them into memory-intensive hypercolumn formations, and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper, we introduce SCE-MAE, a framework that (1) leverages the MAE, a region-level SSL method that naturally better suits the landmark prediction task, (2) operates on the vanilla feature map instead of on expensive hypercolumns, and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust, outperforming existing SOTA methods by large margins of approximately 20%-44% on the landmark matching and approximately 9%-15% on the landmark detection tasks.
Paper Structure (23 sections, 7 equations, 10 figures, 10 tables)

This paper contains 23 sections, 7 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: SCE-MAE vs prior self-supervised facial landmark detection methods. Stage 1: Prior works (top) use instance-level multi-view SSL paradigms that output less distinct initial local features. Our framework (bottom) leverages MAE to naturally form better initial features that result in well-defined boundaries between facial landmarks (see t-SNE plots). Stage 2: Prior works operate on memory-intensive hypercolumns and supervise each feature pair to achieve correspondence. Our framework employs a Correspondence Approximation and Refinement Block (CARB) that operates on the original MAE output and directly hones only the selected correspondence pairs. For the example query, SCE-MAE outputs a more-focused and sharper similarity map, demonstrating the superiority of the final features.
  • Figure 2: An overview of the second stage of our proposed SCE-MAE. We first split the MAE patch tokens into attentive (blue) and inattentive (yellow) tokens based on CLS token similarity. The inattentive tokens are clustered into $K$ cluster centers. In the Correspondence Approximation and Refinement Block (CARB), we first substitute the inattentive tokens using the cluster centers (square symbols) and then refine the local features using our novel Locality-Constrained Repellence (LCR) Loss. The LCR loss weakens existing erroneous correspondences in a weighted manner by considering the token-pair proximity (locality) and correspondence type (repellence) constraints.
  • Figure 3: Comparison between the original (red) and re-annotated (green) landmarks in AFLW$_R$ test set. We denote the original and corrected test sets as AFLW$_{RO}$ and AFLW$_{RC}$ respectively.
  • Figure 4: Qualitative results on landmark matching. The reference/ground-truth are shown in the top/bottom row. The middle rows show the matching results of our method and prior works, grouped column-wise by errors occurring with the eyes, nose and lip corner landmarks respectively. Our method outputs consistently more accurate matching resulting from leveraging higher fidelity projected features.
  • Figure 5: t-SNE plot of the landmark representations.$\dag$ denotes the usage of the stage 1 hypercolumn representations. SC denotes the Silhouette Coefficient SilhouetteCoefficient, a score (higher is better) which measures the quality of the clustering. Our method results in both a clear separation between the landmarks and the densest landmark clusters, resulting in the highest Silhouette Coefficient.
  • ...and 5 more figures