Table of Contents
Fetching ...

Task-adaptive Q-Face

Haomiao Sun, Mingjie He, Shiguang Shan, Hu Han, Xilin Chen

TL;DR

Task-adaptive Q-Face tackles the challenge of learning multiple face analysis tasks with a single model by integrating a shared ViT backbone pre-trained with Mask Image Modeling and a task-adaptive decoder. The framework employs a Multi-stage Feature Fusion module to combine local and global facial cues and a query-driven decoder with learnable label queries and cross-attention to dynamically select task-specific features, thereby reducing inter-task conflict. It achieves state-of-the-art results across expression recognition, action unit detection, attribute recognition, age estimation, and pose estimation on five public datasets, while offering interpretable visualizations of cross-attention and feature usage. The approach promises practical benefits in accuracy and efficiency for comprehensive, unified face analysis, though open-set task extension remains a future direction.

Abstract

Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.

Task-adaptive Q-Face

TL;DR

Task-adaptive Q-Face tackles the challenge of learning multiple face analysis tasks with a single model by integrating a shared ViT backbone pre-trained with Mask Image Modeling and a task-adaptive decoder. The framework employs a Multi-stage Feature Fusion module to combine local and global facial cues and a query-driven decoder with learnable label queries and cross-attention to dynamically select task-specific features, thereby reducing inter-task conflict. It achieves state-of-the-art results across expression recognition, action unit detection, attribute recognition, age estimation, and pose estimation on five public datasets, while offering interpretable visualizations of cross-attention and feature usage. The approach promises practical benefits in accuracy and efficiency for comprehensive, unified face analysis, though open-set task extension remains a future direction.

Abstract

Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
Paper Structure (19 sections, 6 equations, 5 figures, 5 tables)

This paper contains 19 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Compared to previous (a) task-specific methods and (b) multi-task methods, the proposed method can adaptively extract desired features from multi-stage feature maps according to the task requirements, thus exploiting the synergy between tasks.
  • Figure 2: The framework for the proposed task-adaptive Q-Face. We design a task-adaptive decoder to extract task-related features adaptively from different stages and regions. This enables us to perform multiple face analysis tasks more effectively.
  • Figure 3: The diagram about the Multi-stage Feature Fusion (MFF) module. The proposed stage embeddings $SEs$ help our method to distinguish multi-stage features and enable the query-driven decoder to select task-related features more efficiently.
  • Figure 4: Visualization of the attention map from the query-driven module. S4, S8, and S12 represent the attention map related to different stages of features. Q-Face can focus on the relevant regions adaptively. For instance, it pays more attention to the eye region for labels such as eyeglasses and bushy eyebrows.
  • Figure 5: Visualization of how different tasks rely on different stages of features. Our model adaptively selects deeper features for the expression recognition task and age estimation task. In contrast, our model prefers to use shallow features for other tasks.