Table of Contents
Fetching ...

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni

Abstract

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

Abstract

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Pipeline for the extraction of landmark-based dataset. Images are processed to extract landmarks and head pose, which are then used to normalize the data into a virtual camera space.
  • Figure 2: Example image from the GazeGene dataset Bao2025: (a) original and (b) normalized images. In (b), the red arrow indicates the normalized gaze direction, cyan lines are the image principal axis, and green circles denote the $N=20$ landmarks extracted from the MediaPipe face mesh Lugaresi2019 to represent gaze and head orientation: the right iris (indices 473--477) and eye contour (263, 362, 374, 386), the left iris (468--472) and eye contour (33, 133, 145, 159), and two head anchors given by the nose tip (1) and the glabella (9).
  • Figure 3: Architectural comparison of gaze estimation models. (a) Holistic Multi-Layer Perceptron (MLP): facial landmarks are processed through an input projection layer, a stack of $K$ residual blocks and a final regression head. (b) Siamese MLP: facial landmarks are split into local eye regions, processed by two independent encoders, and fused with geometric context (relative eye positions $\Delta\mathbf{c}$ and head anchors $\mathbf{f}_H$) via a fusion MLP. (c) XGBoost: a gradient-boosted tree approach using a multi-output regressor for the gaze vector components, $\mathbf{g}'= \left({g'_x,g'_y,g'_z}\right)$, based on global landmark features, $\mathbf{f}_G$. (d) Residual block used in (a) and (b), featuring a residual connection around two sets of Linear, BatchNorm, GELU, and Dropout layers.
  • Figure 4: Distributions of yaw and pitch angles (in degrees) for gaze (first row) and head pose (second row) across the three datasets - Gaze360 (a), GazeGene (b), and ETH-XGaze (c). Columns (left to right) show the distributions of the original, extracted, and excluded samples. Colors represent density values according to the reported colorbars.
  • Figure 5: Examples of evident errors in face-landmark detection from different datasets (GazeGene Bao2025, Gaze360 Kellnhofer2019, ETH-XGaze Zhang2020). Green circles denote the $N=20$ landmarks extracted using MediaPipe Lugaresi2019. These errors are likely due to low image resolution, face cropping, or extreme poses.
  • ...and 1 more figures