Table of Contents
Fetching ...

Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection

Prashanth Chandran, Gaspard Zoss, Paulo Gotardo, Derek Bradley

TL;DR

This work tackles practical drawbacks of state-of-the-art facial landmark detectors by integrating three architectural enhancements: (1) a spatial transformer that learns built-in face localization and normalization, removing the need for a separate pre-processing detector; (2) a 3D landmark prediction head that outputs landmarks in a canonical 3D space along with head pose and camera intrinsics for stable 2D projection and 3D reasoning; and (3) a semantic correction (query deformation) module that harmonizes annotations across multiple datasets. Together, these extensions enable an infinite set of 2D landmarks, improve temporal stability, and provide rich 3D information useful for tasks like visibility estimation, reconstruction, and texture completion, all demonstrated on a modern continuous 2D landmark detector baseline. Quantitatively, the approach yields meaningful gains in normalized mean error and temporal stability on standard benchmarks, while qualitative results show robust performance across in-the-wild, studio, and helmet-mounted camera scenarios. The work offers a practical pathway to more accurate and versatile facial landmark systems with direct benefits for downstream 3D face analysis tasks.

Abstract

In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.

Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection

TL;DR

This work tackles practical drawbacks of state-of-the-art facial landmark detectors by integrating three architectural enhancements: (1) a spatial transformer that learns built-in face localization and normalization, removing the need for a separate pre-processing detector; (2) a 3D landmark prediction head that outputs landmarks in a canonical 3D space along with head pose and camera intrinsics for stable 2D projection and 3D reasoning; and (3) a semantic correction (query deformation) module that harmonizes annotations across multiple datasets. Together, these extensions enable an infinite set of 2D landmarks, improve temporal stability, and provide rich 3D information useful for tasks like visibility estimation, reconstruction, and texture completion, all demonstrated on a modern continuous 2D landmark detector baseline. Quantitatively, the approach yields meaningful gains in normalized mean error and temporal stability on standard benchmarks, while qualitative results show robust performance across in-the-wild, studio, and helmet-mounted camera scenarios. The work offers a practical pathway to more accurate and versatile facial landmark systems with direct benefits for downstream 3D face analysis tasks.

Abstract

In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.
Paper Structure (17 sections, 6 equations, 12 figures, 3 tables)

This paper contains 17 sections, 6 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Our inputs include a face image $\mathcal{I}$ (un-normalized) and positions $p_k$ on a canonical shape $\mathcal{C}$. A spatial transformer $\mathcal{S}$ auto-normalizes the face for the feature extractor $\mathcal{F}$, which predicts image features $f_i$ and camera plus head pose parameters $\gamma_i$. Query points $p_k$ are passed through our new query deformer $\mathcal{D}$ to account for different datasets $D_j$, and are then position-encoded by $\mathcal{Q}$. A 3D landmark predictor $\mathcal{P}$ estimates the landmarks in a canonical 3D space, which are projected to the camera plane and transformed back to original image space.
  • Figure 2: Our method can predict accurate facial landmarks on a number of practical scenarios including studio setups, in-the-wild videos, mobile phone recordings, and even helmet mounted cameras. We show the result of each stage of our pipeline with the input image $\mathcal{I}$ (first column), the RoI detected by the spatial transformer (second column), the resampled or normalized face image $\mathcal{I}'$ (third column), the intermediate 3D landmarks predicted by the model $\bar{L}_{k}^{3d}$ (fourth column), the resulting 2D landmarks ${l}_{k}'$ corresponding to $\mathcal{I}'$ (fifth column), and the final landmark positions $l_{k}$ (last column).
  • Figure 3: While our method remains competitive with the state-of-the-art baseline in common scenarios (first two rows), it provides significantly better results on challenging scenarios like helmet mounted cameras, where our method is able to capture the overall head shape and expression better than the baseline (last row). Queries $p_{k}$ corresponding to each landmark layout are visualized at the top.
  • Figure 4: We visualize the bounding box trajectories on test videos. In the first column, we show predictions from the widely used face detection algorithm of Zhang et al.. While predicting a tighter crop of the face, the method of Zhang et al. results in a noisy trajectory for the bounding box even with very little movement of the face. The learned RoIs predicted by both the similarity (second column) and affine (third column) spatial transformers, while larger in frame, are temporally smoother.
  • Figure 5: We visualize the normalized image $\mathcal{I'}$in the first row and an overlay of a mesh created using $\bar{L}_{k}^{3d}$ on the image in the second row. The tight overlay of the mesh on the image demonstrate the strong performance of unsupervised pose estimation from $\mathcal{F}$ and 3D landmark predictor $\mathcal{P}$.
  • ...and 7 more figures