Table of Contents
Fetching ...

Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi

TL;DR

This paper tackles automated FEQA for neurological disorders by proposing Trajectory-guided Motion Perception Transformer (TraMP-Former), a dual-stream architecture that fuses 2D facial landmark trajectories (processed by SkateFormer) with RGB frame semantics (via Former-DFER). The cross-modal TraMP fusion blocks update only the RGB stream while leveraging trajectory-derived motion as keys/values, enabling fine-grained motion capture. Evaluations on PFED5 and an augmented Toronto NeuroFace dataset show state-of-the-art performance, with average Spearman correlations of $ ho$ = 71.86% and 53.84%, respectively, illustrating the benefit of incorporating landmark trajectories for nuanced expression quality assessment. Ablation studies confirm the importance of trajectory representation, fusion strategy, and temporal-length choices, supporting the method’s robustness and potential clinical impact. The work suggests extending to 3D landmarks to further mitigate head-rotation noise and enhance trajectory-based motion perception in FEQA.

Abstract

Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.

Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

TL;DR

This paper tackles automated FEQA for neurological disorders by proposing Trajectory-guided Motion Perception Transformer (TraMP-Former), a dual-stream architecture that fuses 2D facial landmark trajectories (processed by SkateFormer) with RGB frame semantics (via Former-DFER). The cross-modal TraMP fusion blocks update only the RGB stream while leveraging trajectory-derived motion as keys/values, enabling fine-grained motion capture. Evaluations on PFED5 and an augmented Toronto NeuroFace dataset show state-of-the-art performance, with average Spearman correlations of = 71.86% and 53.84%, respectively, illustrating the benefit of incorporating landmark trajectories for nuanced expression quality assessment. Ablation studies confirm the importance of trajectory representation, fusion strategy, and temporal-length choices, supporting the method’s robustness and potential clinical impact. The work suggests extending to 3D landmarks to further mitigate head-rotation noise and enhance trajectory-based motion perception in FEQA.

Abstract

Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.

Paper Structure

This paper contains 10 sections, 11 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Landmark trajectories during a clench-teeth action - In (a) and (b), the left panel is for a healthy subject and the right for a PD patient. The healthy case presents apparent variations in the mouth region with subtle motions in the eyebrows and nose areas, whereas the severe PD case exhibits significantly diminished movements in the mouth and little to no motion in the nose or eyebrows. In (c), RGB value variations are shown for 4 example landmarks across time for the healthy subject. Overall in this work, a trajectory comprises position and pixel value for each point.
  • Figure 2: Proposed TraMP-Former pipeline. Input trajectories, comprising 2D landmark positions with associated RGB values over time and RGB frames are encoded via the Trajectory Encoder and RGB Encoder to produce trajectory features ($E_{p}$) and RGB features ($E_{f}$), respectively. $E_{f}$ are average pooled and downsampled to align with $E_{p}$. These are fused in the TraMP fusion module, where $E_{p}$ serves as key and value, and $E_{f}$ as the query, to generate the final 1D representation $E_{O}$, which is passed through an MLP to predict the clinical score $\hat{y}$.
  • Figure 3: Landmarks and their grouping - (top) The $P$ landmarks are split into $M=7$ groups of $N=9$. (bottom) Example landmark-temporal partitioning: The pink and brown dotted trajectory segments in the red box represent the spatio-temporal relationship between two trajectories from different groups within a local temporal segment, while the brown and black dotted trajectory segments in the light green box indicate the spatio-temporal relationship between two trajectories from the same group (i.e., the left eye) within a local temporal segment. The red cross-lines depict the correlation between two landmarks from different groups within global motion, and light green cross-lines represent the correlation between two landmarks from the same group (i.e., upper lip) within global motion.
  • Figure 4: Score distribution for each action in the augmented Toronto NeuroFace dataset. The x-axis represents the average overall score from two raters and the y-axis indicates the frequency of each score.