Table of Contents
Fetching ...

Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Jiading Li, Zhiyu Zhu, Jinhui Hou, Junhui Hou, Jinjian Wu

TL;DR

The paper addresses passive gaze estimation by fusing event-based and frame-based data. It reframes gaze as a state shift to registered anchor states and proposes a two-stage coarse-to-fine framework with local expert networks, followed by a local-global latent denoising diffusion distillation into a global student. Through hard and soft distillation losses and a diffusion-based denoising strategy, the method achieves substantial gains over state-of-the-art, with near-eye MAE around 1.93° and high accuracy on a hybrid event-frame dataset. The approach demonstrates improved robustness to event noise and cross-modal fusion, offering a practical, high-temporal-resolution solution for real-time gaze tracking.

Abstract

This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15$\%$. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.

Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

TL;DR

The paper addresses passive gaze estimation by fusing event-based and frame-based data. It reframes gaze as a state shift to registered anchor states and proposes a two-stage coarse-to-fine framework with local expert networks, followed by a local-global latent denoising diffusion distillation into a global student. Through hard and soft distillation losses and a diffusion-based denoising strategy, the method achieves substantial gains over state-of-the-art, with near-eye MAE around 1.93° and high accuracy on a hybrid event-frame dataset. The approach demonstrates improved robustness to event noise and cross-modal fusion, offering a practical, high-temporal-resolution solution for real-time gaze tracking.

Abstract

This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.
Paper Structure (12 sections, 11 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 12 sections, 11 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: Left: Overview of our gaze estimation setup: our framework emphasizes the modeling of gaze shifts from a registered anchor state to the currently acquired state captured during actual use. Our approach takes input in the form of a frame coupled with corresponding event data to infer the position of the directional gaze point as the output. Right: Beyond the confines of static frame-based gaze estimation, studying dynamic ocular movements constitutes an additional research trajectory within computer vision.
  • Figure 2: Illustration of the workflow of the proposed framework, where Black arrow (resp. Pink arrow) represents the training (resp. testing) pipeline. The First Stage (Sec. \ref{['Sec:Gaze_struct']}): State Correlation Modeling by Local Expert. We first partition the entire gaze points region into several sub-regions, wherein each region's data is trained to cultivate a local expert network. Each expert network is simultaneously fed with the anchor state and a search state and utilizes the transformers to explicitly model the correlation between the anchor and states. The Second Stage (Sec. \ref{['Sec:Distillation']}): Local-Global Latent Denoising Distillation. A latent denoising knowledge distillation method is introduced to amalgamate the expertise of these several local expert networks into a singular, comprehensive student network. Note that the latent denoising and knowledge distillation are utilized in the training phase only (see details in Sec. \ref{['sec:experiment']}). Anchor selection in the light pink box is illustrated in detail in Fig. \ref{['fig:anchorselection']}.
  • Figure 3: Illustration of gaze estimation accuracy by trained using different perceived sizes, denoted as $n\times n$. Moreover, all models are evaluated on data with perceived regions identical to those in their respective training sets. The experimental results indicate that an incremental increase in the training dataset region leads to a substantial degradation in network performance. Moreover, the incremental of the network's parameters is for fitting the dataset (otherwise, the network is hard to converge). Meanwhile, as shown in the rightmost example, directly training with multiple anchors in the 11$\times$11 region is also hard to converging on an accurate result. This observation suggests that instead of directly training on the whole region, we can distil those small but accurate models in local regions into a large student network for accurate modelling of gaze motion. $\uparrow$ (resp. $\downarrow$) indicates that larger (resp. smaller) values are better.
  • Figure 4: (a) illustrates the outcome of training with a model on a large region, exhibiting pronounced over-fitting, as evidenced by the heatmap, indicating attention dispersion away from the ocular region. (b) showcases the performance of our model, which makes distillation of knowledge from a set of local experts, with a heatmap that is distinctly concentrated on the ocular region. The visualization indicates that through the proposed local-global distillation, network has accurate attention on the relevant region.
  • Figure 5: Impact of the number of registered anchor states on prediction accuracy. The results demonstrate that increasing the number of anchor states generally improves prediction accuracy, with an optimal performance achieved when using 5 anchor states. Note that product accuracy can be observed on both vertical axes.
  • ...and 5 more figures