Table of Contents
Fetching ...

FaceXFormer: A Unified Transformer for Facial Analysis

Kartik Narayan, Vibashan VS, Rama Chellappa, Vishal M. Patel

TL;DR

FaceXFormer tackles the fragmentation of facial analysis by introducing a unified transformer that handles ten disparate tasks in real time. It fuses multi-scale encoder outputs with a lightweight FaceXDecoder that uses task tokens and bi-directional cross-attention to jointly optimize task-specific predictions. The approach achieves state-of-the-art or competitive results across face parsing, landmark detection, head pose estimation, attributes, age/gender/race, expression, recognition, and visibility, while running at 33.21 FPS and maintaining low computational cost. This work demonstrates the practicality of a unified, efficient architecture for large-scale, multi-task facial analysis and provides a foundation for future foundation-model-style face systems that can annotate diverse datasets rapidly.

Abstract

In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. These tasks include face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility. Traditional face analysis approaches rely on task-specific architectures and pre-processing techniques, limiting scalability and integration. In contrast, FaceXFormer employs a transformer-based encoder-decoder architecture, where each task is represented as a learnable token, enabling seamless multi-task processing within a unified model. To enhance efficiency, we introduce FaceX, a lightweight decoder with a novel bi-directional cross-attention mechanism, which jointly processes face and task tokens to learn robust and generalized facial representations. We train FaceXFormer on ten diverse face perception datasets and evaluate it against both specialized and multi-task models across multiple benchmarks, demonstrating state-of-the-art or competitive performance. Additionally, we analyze the impact of various components of FaceXFormer on performance, assess real-world robustness in "in-the-wild" settings, and conduct a computational performance evaluation. To the best of our knowledge, FaceXFormer is the first model capable of handling ten facial analysis tasks while maintaining real-time performance at 33.21 FPS. Code: https://github.com/Kartik-3004/facexformer

FaceXFormer: A Unified Transformer for Facial Analysis

TL;DR

FaceXFormer tackles the fragmentation of facial analysis by introducing a unified transformer that handles ten disparate tasks in real time. It fuses multi-scale encoder outputs with a lightweight FaceXDecoder that uses task tokens and bi-directional cross-attention to jointly optimize task-specific predictions. The approach achieves state-of-the-art or competitive results across face parsing, landmark detection, head pose estimation, attributes, age/gender/race, expression, recognition, and visibility, while running at 33.21 FPS and maintaining low computational cost. This work demonstrates the practicality of a unified, efficient architecture for large-scale, multi-task facial analysis and provides a foundation for future foundation-model-style face systems that can annotate diverse datasets rapidly.

Abstract

In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. These tasks include face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility. Traditional face analysis approaches rely on task-specific architectures and pre-processing techniques, limiting scalability and integration. In contrast, FaceXFormer employs a transformer-based encoder-decoder architecture, where each task is represented as a learnable token, enabling seamless multi-task processing within a unified model. To enhance efficiency, we introduce FaceX, a lightweight decoder with a novel bi-directional cross-attention mechanism, which jointly processes face and task tokens to learn robust and generalized facial representations. We train FaceXFormer on ten diverse face perception datasets and evaluate it against both specialized and multi-task models across multiple benchmarks, demonstrating state-of-the-art or competitive performance. Additionally, we analyze the impact of various components of FaceXFormer on performance, assess real-world robustness in "in-the-wild" settings, and conduct a computational performance evaluation. To the best of our knowledge, FaceXFormer is the first model capable of handling ten facial analysis tasks while maintaining real-time performance at 33.21 FPS. Code: https://github.com/Kartik-3004/facexformer
Paper Structure (31 sections, 5 equations, 4 figures, 11 tables)

This paper contains 31 sections, 5 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: FaceXFormer an end-to-end unified transformer model for 10 different facial analysis tasks such as face parsing, landmark detection, head pose estimation, attributes recognition, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility prediction.
  • Figure 2: Overview of our proposed framework. The FaceXFormer employs an encoder-decoder architecture, extracting multi-scale features from the input face image $\mathbf{I}$, and fusing them into a unified representation $\mathbf{F}$ via MLP-Fusion. Task tokens $\mathbf{T}$ are processed alongside face representation $\mathbf{F}$ in the FaceX Decoder $\mathbf{FXDec}$, resulting in refined task-specific tokens $\mathbf{\hat{T}}$. These refined tokens are then used for task-specific predictions by passing through the unified head. FaceXFormer performs ten tasks, including face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility prediction, achieving state-of-the-art performance at a real-time FPS of $33.21$.
  • Figure 3: FaceXFormer predictions on "in-the-wild" images
  • Figure E.1: Visualization of "in-the-wild" images for multiple tasks. Attributes represent the $40$ binary attributes defined in the CelebA liu2015faceattributes dataset, indicating the presence ($1$) or absence ($0$) of specific facial attributes.