Table of Contents
Fetching ...

Efficient Facial Landmark Detection for Embedded Systems

Ji-Jia Wu

TL;DR

This work targets robust, low-power facial landmark detection on edge devices by introducing the Efficient Facial Landmark Detection (EFLD) architecture. It combines an efficient backbone built from Efficient-OSA modules, a flexible multi-head detection system for different landmark formats, and a cross-format training strategy that leverages heterogeneous public datasets without increasing inference cost. Key contributions include the EOSA-based lightweight backbone, a modular detection-head design allowing 51/68/98-point formats, and a data-augmentation strategy that preserves efficiency while enhancing generalization. Empirically, EFLD achieves top performance and energy efficiency in the IEEE ICME 2024 Grand Challenges PAIR Competition, demonstrating strong potential for real-world embedded deployments with int8 quantization and deployment via a lightweight TFLite/pipeline.

Abstract

This paper introduces the Efficient Facial Landmark Detection (EFLD) model, specifically designed for edge devices confronted with the challenges related to power consumption and time latency. EFLD features a lightweight backbone and a flexible detection head, each significantly enhancing operational efficiency on resource-constrained devices. To improve the model's robustness, we propose a cross-format training strategy. This strategy leverages a wide variety of publicly accessible datasets to enhance the model's generalizability and robustness, without increasing inference costs. Our ablation study highlights the significant impact of each component on reducing computational demands, model size, and improving accuracy. EFLD demonstrates superior performance compared to competitors in the IEEE ICME 2024 Grand Challenges PAIR Competition, a contest focused on low-power, efficient, and accurate facial-landmark detection for embedded systems, showcasing its effectiveness in real-world facial landmark detection tasks.

Efficient Facial Landmark Detection for Embedded Systems

TL;DR

This work targets robust, low-power facial landmark detection on edge devices by introducing the Efficient Facial Landmark Detection (EFLD) architecture. It combines an efficient backbone built from Efficient-OSA modules, a flexible multi-head detection system for different landmark formats, and a cross-format training strategy that leverages heterogeneous public datasets without increasing inference cost. Key contributions include the EOSA-based lightweight backbone, a modular detection-head design allowing 51/68/98-point formats, and a data-augmentation strategy that preserves efficiency while enhancing generalization. Empirically, EFLD achieves top performance and energy efficiency in the IEEE ICME 2024 Grand Challenges PAIR Competition, demonstrating strong potential for real-world embedded deployments with int8 quantization and deployment via a lightweight TFLite/pipeline.

Abstract

This paper introduces the Efficient Facial Landmark Detection (EFLD) model, specifically designed for edge devices confronted with the challenges related to power consumption and time latency. EFLD features a lightweight backbone and a flexible detection head, each significantly enhancing operational efficiency on resource-constrained devices. To improve the model's robustness, we propose a cross-format training strategy. This strategy leverages a wide variety of publicly accessible datasets to enhance the model's generalizability and robustness, without increasing inference costs. Our ablation study highlights the significant impact of each component on reducing computational demands, model size, and improving accuracy. EFLD demonstrates superior performance compared to competitors in the IEEE ICME 2024 Grand Challenges PAIR Competition, a contest focused on low-power, efficient, and accurate facial-landmark detection for embedded systems, showcasing its effectiveness in real-world facial landmark detection tasks.
Paper Structure (14 sections, 1 equation, 2 figures, 2 tables)

This paper contains 14 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Model overview of our proposed method. Our method consists of three major components: (a) an efficient backbone network that transforms each input image into a feature vector, (b) multiple facial landmark detection head networks predicting facial landmarks in various formats, and (c) a cross-format training strategy that supports training across different facial landmark formats. When exporting our model for application, we only include the 51-point component in the exported model.
  • Figure 2: Detailed architecture of each components The detailed architecture of the proposed Efficient-OSA (EOSA) module, decoder, and the facial landmark detection head network. Each module is lightweight due to the use of efficient operations such as feature concatenation layers and depthwise convolutional layers.