Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation
Esmaeil Seraj, Harsh Bhate, Walter Talamonti
TL;DR
The paper tackles real-time gaze tracking on dynamic transparent displays by introducing a dual-module system: a dynamic Quadtree-based gaze target tracker to identify the projected widget under gaze and a lightweight multi-stream self-attention model to estimate gaze depth and prevent undesired interactions. The approach is trained and evaluated on real eye-tracking data, with extensive ablations and SoC (TI TDA4 VM) inference demonstrating real-time feasibility and high accuracy. Key contributions include a robust, scalable Quadtree algorithm for moving/overlapping 2D widgets and a depth-estimation network that leverages intra- and inter-stream attention to model physical dependencies in eye-tracking data. The work advances TD-based AR/HUD interactions by enabling precise, depth-aware gaze monitoring suitable for automotive applications and real-time embedded deployment, with potential for improved safety and user experience.
Abstract
The integration of Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences. However, this innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user's gaze on dynamically changing TDs. In this paper, we present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets (i.e., moving, size-changing, and overlapping 2D content) projected on a transparent display, in realtime; (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data, to account for the display's transparency and preventing undesired interactions with the TD. We collected a real-world eye-tracking dataset to train and test our gaze monitoring system. We present extensive results and ablation studies, including inference experiments on System on Chip (SoC) evaluation boards, demonstrating our model's scalability, precision, and realtime feasibility in both static and dynamic contexts. Our solution marks a significant stride in enhancing next-generation user-device interaction and experience, setting a new benchmark for algorithmic gaze monitoring technology in dynamic transparent displays.
