Table of Contents
Fetching ...

Faceptor: A Generalist Model for Face Perception

Lixiong Qin, Mei Wang, Xuannan Liu, Yuhang Zhang, Wei Deng, Xiaoshuai Song, Weiran Xu, Weihong Deng

TL;DR

Faceptor tackles the challenge of unified face perception by proposing two family architectures: Naive Faceptor with a shared backbone and standardized output heads, and Faceptor with a single-encoder dual-decoder plus Layer-Attention for task-specific semantics. The approach leverages a transformer-based encoder–decoder framework and a pixel decoder to support dense, attribute, and identity predictions within a multi-task setting, enhanced by Layer-Attention and a two-stage training scheme. Across 13 datasets and 6 tasks, Faceptor demonstrates competitive or superior performance on dense and attribute tasks, with strong data efficiency achieved via auxiliary supervised learning, and significantly better storage efficiency than naive head-sharing as tasks scale. The results indicate strong potential for scalable, multi-task face perception systems with practical deployment benefits, and the work provides a public codebase for replication and extension.

Abstract

With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at https://github.com/lxq1000/Faceptor.

Faceptor: A Generalist Model for Face Perception

TL;DR

Faceptor tackles the challenge of unified face perception by proposing two family architectures: Naive Faceptor with a shared backbone and standardized output heads, and Faceptor with a single-encoder dual-decoder plus Layer-Attention for task-specific semantics. The approach leverages a transformer-based encoder–decoder framework and a pixel decoder to support dense, attribute, and identity predictions within a multi-task setting, enhanced by Layer-Attention and a two-stage training scheme. Across 13 datasets and 6 tasks, Faceptor demonstrates competitive or superior performance on dense and attribute tasks, with strong data efficiency achieved via auxiliary supervised learning, and significantly better storage efficiency than naive head-sharing as tasks scale. The results indicate strong potential for scalable, multi-task face perception systems with practical deployment benefits, and the work provides a public codebase for replication and extension.

Abstract

With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at https://github.com/lxq1000/Faceptor.
Paper Structure (62 sections, 14 equations, 4 figures, 17 tables)

This paper contains 62 sections, 14 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Existing efforts for unified face perception mainly concentrate on representation and training. Our work focuses on unified model structure, achieving improved task extensibility and increased application efficiency by two designs of face generalist models.
  • Figure 2: Overall architecture for the proposed Faceptor
  • Figure 3: Two-stage training process to ensure the effectiveness of Layer-Attention mechanism.
  • Figure 4: Overall architecture for the proposed Naive Faceptor