Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

Zong-Wei Hong; Yu-Chen Lin

Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

Zong-Wei Hong, Yu-Chen Lin

TL;DR

The paper addresses robust facial-landmark detection under embedded-resource constraints by transferring knowledge from a heavy transformer-based teacher (SwinV2) to a lightweight mobile student (MobileViT-v2) using heatmap-based regression and a two-stage training regime. It introduces an anisotropic attention module and STAR-based supervision, with a distillation loss that aligns teacher and student heatmaps $\mathcal{L}_{KD} = \sum_{i=1}^N \sum_k \| [\mathcal{G}_T(\mathcal{F}_T(\cdot))]^i_k - [\mathcal{G}_S(\mathcal{F}_S(\cdot))]^i_k \|_2$, enabling real-time inference on embedded devices. The approach achieves competitive accuracy while significantly improving efficiency, aided by tflite-runtime compatibility considerations and successful competition placement (top 6 of 165 in IEEE ICME 2024 PAIR). The combination of heatmap-based decoding, transformer-backed distillation, and mobile-friendly architectural adaptations offers practical impact for AR, facial analysis, and biometric systems on low-power hardware. The work highlights the importance of model architecture choices and cross-domain compatibility when deploying advanced facial-landmark solutions in constrained environments.

Abstract

The domain of computer vision has experienced significant advancements in facial-landmark detection, becoming increasingly essential across various applications such as augmented reality, facial recognition, and emotion analysis. Unlike object detection or semantic segmentation, which focus on identifying objects and outlining boundaries, faciallandmark detection aims to precisely locate and track critical facial features. However, deploying deep learning-based facial-landmark detection models on embedded systems with limited computational resources poses challenges due to the complexity of facial features, especially in dynamic settings. Additionally, ensuring robustness across diverse ethnicities and expressions presents further obstacles. Existing datasets often lack comprehensive representation of facial nuances, particularly within populations like those in Taiwan. This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. By transferring knowledge from larger models to smaller ones, we aim to create lightweight yet powerful deep learning models tailored specifically for facial-landmark detection tasks. Our goal is to design models capable of accurately locating facial landmarks under varying conditions, including diverse expressions, orientations, and lighting environments. The ultimate objective is to achieve high accuracy and real-time performance suitable for deployment on embedded systems. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.

Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

TL;DR

, enabling real-time inference on embedded devices. The approach achieves competitive accuracy while significantly improving efficiency, aided by tflite-runtime compatibility considerations and successful competition placement (top 6 of 165 in IEEE ICME 2024 PAIR). The combination of heatmap-based decoding, transformer-backed distillation, and mobile-friendly architectural adaptations offers practical impact for AR, facial analysis, and biometric systems on low-power hardware. The work highlights the importance of model architecture choices and cross-domain compatibility when deploying advanced facial-landmark solutions in constrained environments.

Abstract

Paper Structure (15 sections, 3 equations, 1 figure, 3 tables)

This paper contains 15 sections, 3 equations, 1 figure, 3 tables.

Introduction
Implementation Techniques
Preliminary
Model Architecture
Teacher architecture
Student architecture
Heatmap generator architecture
Knowledge Distillation Loss
Experiment
Experiment setting
Model Complexity & Model Execution Efficiency
Normalized Mean Square Error
Converted Model Complexity and Execution Efficiency
Model Structure
Conclusion

Figures (1)

Figure 1: We propose a two-stage training process for our method. In the first stage, we train the teacher model, denoted as $\mathcal{F}_{T}$, using a combination of two loss functions: $\mathcal{L}_{AAM}$ and $\mathcal{L}_{STAR}$. Subsequently, in the second stage, we train the student model by introducing an additional loss function, $\mathcal{L}_{KD}$.

Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

TL;DR

Abstract

Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)