GADS: A Super Lightweight Model for Head Pose Estimation
Menan Velayuthan, Asiri Gawesha, Purushoth Velayuthan, Nuwan Kodagoda, Dharshana Kasthurirathna, Pradeepa Samarasinghe
TL;DR
This work tackles the challenge of efficient head pose estimation (HPE) for edge devices by introducing Grouped Attention Deep Sets (GADS), a landmark-based architecture that partitions facial landmarks into five regions and processes them with parallel Deep Set encoders. A multi-head attention module fuses inter-group information, enabling a compact vanilla model and an even more capable hybrid (landmarks+RGB) variant. GADS achieves state-of-the-art-like accuracy while being orders of magnitude smaller and faster than existing methods—up to 7.5x smaller than the smallest prior model and 25x faster than the lightest SOTA—demonstrating strong edge-device suitability across AFLW2000, BIWI, and 300W-LP. The approach establishes a robust, scalable baseline for resource-constrained HPE and has potential extensions to other landmark-based downstream tasks; open-source code will be released.
Abstract
In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is $7.5\times$ smaller and executes $25\times$ faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being $4321\times$ smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.
