Table of Contents
Fetching ...

EM-Net: Gaze Estimation with Expectation Maximization Algorithm

Zhang Cheng, Yanxia Wang, Guoyu Xia

TL;DR

EM-Net addresses the need for accurate gaze estimation under resource constraints by combining a Global Attention Mechanism with an Expectation Maximization–based refinement in a lightweight MobileNetV3 backbone. It jointly expands receptive field and robustness to incomplete data, enabling strong performance on full-face gaze estimation with only 50% of the training data. Empirical results on MPIIFaceGaze, Gaze360, and RT-Gene show small but consistent angular-error gains and dramatic reductions in parameters and FLOPs, with only modest impact on inference time due to EM iterations. These findings suggest a practical approach for real-time, low-power gaze estimation in noisy or occluded conditions.

Abstract

In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference.

EM-Net: Gaze Estimation with Expectation Maximization Algorithm

TL;DR

EM-Net addresses the need for accurate gaze estimation under resource constraints by combining a Global Attention Mechanism with an Expectation Maximization–based refinement in a lightweight MobileNetV3 backbone. It jointly expands receptive field and robustness to incomplete data, enabling strong performance on full-face gaze estimation with only 50% of the training data. Empirical results on MPIIFaceGaze, Gaze360, and RT-Gene show small but consistent angular-error gains and dramatic reductions in parameters and FLOPs, with only modest impact on inference time due to EM iterations. These findings suggest a practical approach for real-time, low-power gaze estimation in noisy or occluded conditions.

Abstract

In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference.

Paper Structure

This paper contains 20 sections, 7 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: EM-Net.
  • Figure 2: Bneck network structure. NL indicates the activation function, which is different for different layers. For details, see Table 1. Dwise indicates the grouping convolution. When the input channel is equal to exp_size, there is no 1×1 convolution at the input end to raise the dimension.
  • Figure 3: SE and GAM.
  • Figure 4: Schematic diagram of GAM information exchange.
  • Figure 5: EM Module.
  • ...and 3 more figures