Table of Contents
Fetching ...

1DFormer: a Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

Shi Yin, Shijie Huan, Shangfei Wang, Jinshui Hu, Tao Guo, Bing Yin, Baocai Yin, Cong Liu

TL;DR

Facial landmark tracking in-the-wild demands modeling of long-range temporal dynamics and facial geometry. The authors introduce 1DFormer, a Transformer-based architecture that learns 1D landmark representations through temporal modeling with a confidence-aware recurrent token mixing mechanism and axis-landmark positional embeddings, coupled with a structural module that encodes intra- and inter-group facial geometry via 1D convolutions. The approach jointly optimizes 1D heatmaps and feature confidences, using pseudo labels for confidences during training and a staged optimization schedule. Experiments on the 300VW and TF datasets demonstrate state-of-the-art accuracy and stability, showing strong robustness to occlusions and appearance variations with efficient computation.

Abstract

Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.

1DFormer: a Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

TL;DR

Facial landmark tracking in-the-wild demands modeling of long-range temporal dynamics and facial geometry. The authors introduce 1DFormer, a Transformer-based architecture that learns 1D landmark representations through temporal modeling with a confidence-aware recurrent token mixing mechanism and axis-landmark positional embeddings, coupled with a structural module that encodes intra- and inter-group facial geometry via 1D convolutions. The approach jointly optimizes 1D heatmaps and feature confidences, using pseudo labels for confidences during training and a staged optimization schedule. Experiments on the 300VW and TF datasets demonstrate state-of-the-art accuracy and stability, showing strong robustness to occlusions and appearance variations with efficient computation.

Abstract

Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.
Paper Structure (18 sections, 8 equations, 8 figures, 4 tables)

This paper contains 18 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An illustration of 1D heatmap regression methods, which are built upon 1D landmark representations, including 1D feature vectors and heatmaps. Although current methods achieved remarkable performance on detecting landmarks, they ignored a deep exploration on temporal and structural modeling, which is critical for landmark tracking.
  • Figure 2: Upper part: an architectural overview of the proposed facial landmark tracking method, i.e., 1DFormer. Lower part : the internal architecture of a basic block of 1DFormer on the x axis. The architecture of basic block on the y axis is the same as the x axis.
  • Figure 3: Visualization of the tracking results on a challenging video clip from the 300VW S3. The red points and green points are respectively the tracking results without / with the recurrently token mixing strategy.
  • Figure 4: Visualization of the attention weights as well as the tracked results for an occluded landmark, i.e., the left corner of outer-ocular, from a challenging movie clip. The red point and green point are respectively the tracking results without / with the help of the confidence branch.
  • Figure 5: Visualization of the tracking results on two challenging video frames, the former from a movie while the latter from the 30VW S3. The red points and green points are respectively the results of the tracker without / with structural modeling mechanisms.
  • ...and 3 more figures