Table of Contents
Fetching ...

PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization

Ali Shadman Yazdi, Annalisa Cappella, Benedetta Baldini, Riccardo Solazzo, Gianluca Tartaglia, Chiarella Sforza, Giuseppe Baselli

TL;DR

PAL-Net introduces a lightweight Patch-Attention CNN that localizes 50 anatomical facial landmarks on 3D stereo-photogrammetry meshes by combining atlas-guided patch extraction, local patch learning with 1×1 convolutions, and global attention to preserve inter-landmark geometry. The method achieves state-of-the-art accuracy with a mean point-wise error around 3.69 mm on LAFAS and 0.41 mm on FaceScape, while maintaining low memory usage and fast training. It also demonstrates robust distance preservation (≈2.82 mm) and favorable generalization across datasets, though performance degrades in poorly reconstructed regions like ears and hairline. The work offers a scalable, clinically relevant solution for automated high-throughput 3D anthropometry, with potential to streamline clinical workflows and reduce manual annotation effort, and it provides open-source code for reproducibility.

Abstract

Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention

PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization

TL;DR

PAL-Net introduces a lightweight Patch-Attention CNN that localizes 50 anatomical facial landmarks on 3D stereo-photogrammetry meshes by combining atlas-guided patch extraction, local patch learning with 1×1 convolutions, and global attention to preserve inter-landmark geometry. The method achieves state-of-the-art accuracy with a mean point-wise error around 3.69 mm on LAFAS and 0.41 mm on FaceScape, while maintaining low memory usage and fast training. It also demonstrates robust distance preservation (≈2.82 mm) and favorable generalization across datasets, though performance degrades in poorly reconstructed regions like ears and hairline. The work offers a scalable, clinically relevant solution for automated high-throughput 3D anthropometry, with potential to streamline clinical workflows and reduce manual annotation effort, and it provides open-source code for reproducibility.

Abstract

Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Preprocessing pipeline for 3D facial data used in the study. The pipeline consists of coarse registration, cropping, local registration.
  • Figure 2: Architecture overview of PAL-Net for predicting anatomical landmarks on the LAFAS dataset (50 landmarks). On the left, localized facial patches (orange) are extracted around approximated landmark positions (red) based on population-averaged coordinates, with ground truth annotations (green) shown for reference but not used as input. Patches are processed through a series of 2D point-wise CNN blocks with increasing feature depth and max pooling. Attention modules capture global context across all patches. The combined features are passed through fully connected layers to predict the 3D coordinates of 50 landmarks.
  • Figure 3: Example 3D facial meshes with annotated landmarks used in this study.
  • Figure 4: Average distance-wise error matrix between ground truth and predicted landmarks averaged over the 5 fold cross validation. Each entry represents the absolute difference in pairwise distances for a specific landmark pair, averaged over the test set. The mean of the matrix is 2.822mm, indicating the overall distance-wise error.
  • Figure 5: Bland–Altman plots showing prediction errors by facial region (midline, right, left). Each plot displays the difference between predicted and ground-truth coordinates versus their mean value. The dashed lines represent the mean difference (gray) and the 95% limits of agreement (red and blue). The results indicate no systematic bias across regions, with most predictions falling within acceptable error bounds.
  • ...and 2 more figures