Table of Contents
Fetching ...

GADS: A Super Lightweight Model for Head Pose Estimation

Menan Velayuthan, Asiri Gawesha, Purushoth Velayuthan, Nuwan Kodagoda, Dharshana Kasthurirathna, Pradeepa Samarasinghe

TL;DR

This work tackles the challenge of efficient head pose estimation (HPE) for edge devices by introducing Grouped Attention Deep Sets (GADS), a landmark-based architecture that partitions facial landmarks into five regions and processes them with parallel Deep Set encoders. A multi-head attention module fuses inter-group information, enabling a compact vanilla model and an even more capable hybrid (landmarks+RGB) variant. GADS achieves state-of-the-art-like accuracy while being orders of magnitude smaller and faster than existing methods—up to 7.5x smaller than the smallest prior model and 25x faster than the lightest SOTA—demonstrating strong edge-device suitability across AFLW2000, BIWI, and 300W-LP. The approach establishes a robust, scalable baseline for resource-constrained HPE and has potential extensions to other landmark-based downstream tasks; open-source code will be released.

Abstract

In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is $7.5\times$ smaller and executes $25\times$ faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being $4321\times$ smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.

GADS: A Super Lightweight Model for Head Pose Estimation

TL;DR

This work tackles the challenge of efficient head pose estimation (HPE) for edge devices by introducing Grouped Attention Deep Sets (GADS), a landmark-based architecture that partitions facial landmarks into five regions and processes them with parallel Deep Set encoders. A multi-head attention module fuses inter-group information, enabling a compact vanilla model and an even more capable hybrid (landmarks+RGB) variant. GADS achieves state-of-the-art-like accuracy while being orders of magnitude smaller and faster than existing methods—up to 7.5x smaller than the smallest prior model and 25x faster than the lightest SOTA—demonstrating strong edge-device suitability across AFLW2000, BIWI, and 300W-LP. The approach establishes a robust, scalable baseline for resource-constrained HPE and has potential extensions to other landmark-based downstream tasks; open-source code will be released.

Abstract

In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is smaller and executes faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.

Paper Structure

This paper contains 35 sections, 19 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Architectures of the GADS and the Deep Set Layer
  • Figure 2: GADS Hybrid Architecture: Landmarks are processed by the vanilla GADS model, while the input image undergoes a CNN block. The outputs are concatenated and fed through a final 3-unit Linear layer to obtain the final output.
  • Figure 3: The GADS Hybrid CNN consists of three convolution blocks with $5\times5\times16$ filters, followed by Tanh activation and $2\times2$ Average pooling. The final output is obtained by flattening the last convolution block's output and passing it through two pairs of Linear + Tanh layers.
  • Figure 4: Examples of 68 landmarks extracted using the FAN landmark detector on the AFLW2000 dataset.
  • Figure 5: Illustration of 27 selected landmarks grouped into five sections, each represented by a distinct color. Landmarks within the same color belong to the same group: Left eye (6 points) - Purple, Right eye (6 points) - Green, Left cheek (5 points) - Red, Right cheek (5 points) - Yellow, Chin (5 points) - Cyan.
  • ...and 5 more figures