Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

Esmaeil Seraj; Harsh Bhate; Walter Talamonti

Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

Esmaeil Seraj, Harsh Bhate, Walter Talamonti

TL;DR

The paper tackles real-time gaze tracking on dynamic transparent displays by introducing a dual-module system: a dynamic Quadtree-based gaze target tracker to identify the projected widget under gaze and a lightweight multi-stream self-attention model to estimate gaze depth and prevent undesired interactions. The approach is trained and evaluated on real eye-tracking data, with extensive ablations and SoC (TI TDA4 VM) inference demonstrating real-time feasibility and high accuracy. Key contributions include a robust, scalable Quadtree algorithm for moving/overlapping 2D widgets and a depth-estimation network that leverages intra- and inter-stream attention to model physical dependencies in eye-tracking data. The work advances TD-based AR/HUD interactions by enabling precise, depth-aware gaze monitoring suitable for automotive applications and real-time embedded deployment, with potential for improved safety and user experience.

Abstract

The integration of Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences. However, this innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user's gaze on dynamically changing TDs. In this paper, we present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets (i.e., moving, size-changing, and overlapping 2D content) projected on a transparent display, in realtime; (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data, to account for the display's transparency and preventing undesired interactions with the TD. We collected a real-world eye-tracking dataset to train and test our gaze monitoring system. We present extensive results and ablation studies, including inference experiments on System on Chip (SoC) evaluation boards, demonstrating our model's scalability, precision, and realtime feasibility in both static and dynamic contexts. Our solution marks a significant stride in enhancing next-generation user-device interaction and experience, setting a new benchmark for algorithmic gaze monitoring technology in dynamic transparent displays.

Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 6 figures, 1 table, 2 algorithms)

This paper contains 19 sections, 1 equation, 6 figures, 1 table, 2 algorithms.

Introduction
Related Work
Preliminaries
Problem Formulation
Quadtree Data Structures
Self-Attention Mechanism
Method
High-level Overview
Dynamic Quadtree for Gaze Target Tracking
Quadtree Initialization and Widget Insertion
Widget Identification and Dynamic Updates
Multi-Stream Attention Model for Gaze Depth-Level Estimation from Eye-Tracking Data
Gaze Depth Model Architecture
Evaluation
Eye Tracking Data Collection
...and 4 more sections

Figures (6)

Figure 1: An example of a Transparent Display (TD), showing a large-scale dynamic Heads-Up Display (HUD) in a vehicle. A systematic solution is required to identify and smoothly track human gaze for downstream AR actions on projected HUD content, in realtime (image credit to DALL.E).
Figure 2: High-level overview of our gaze monitoring system. The gaze target tracking and gaze depth-level estimation modules work in parallel to robustly determine the focus of human gaze on the projected 2D widgets on the TD. Our system receives realtime eye-tracking data obtained from existing technology.
Figure 3: The architecture of the proposed Quadtree-based realtime dynamic gaze target tracking algorithm (i.e., the first module in our gaze monitoring system shown in Fig. \ref{['fig:system_architecture']}) for transparent displays (e.g., here a dynamic HUD). At any given time, the algorithm receives the stream of the gaze target intersection points on the AR display, given by the eye-tracking camera, as well as the stream of widget information tuples. The tree structure is updated in realtime based on the received set of widget information and an efficient tree-traversal (i.e., Depth-First Search (DFS)) is performed to associate a gaze target intersection point to a leaf-node in the constructed tree.
Figure 4: The proposed multi-stream attention model for gaze depth-level estimation from composite multi-dimensional eye-tracking data. To capture the physical dependencies within and between different portions of the eye-tracking data, we leverage intra- and inter-stream self-attention layers. The intra-stream attention layers capture the physical relations between the elements of each data section (i.e., connection between left and right eye or their internal X, Y, and Z values), while the inter-stream attention captures the relation between eye position, eye rotation (gaze vectors), and 2D gaze target intersection points. The output is a probability over discrete gaze depth levels.
Figure 5: Scalability analysis for our tree-based target tracking algorithm. Results demonstrate system's input-output latency (mean$\pm$STD) in three experimental scenarios described in Section \ref{['subsec:Results and Discussion']}, over a range of virtual widgets projected on a simulated transparent display. As shown, the framework can approximately linearly scale to the number of widgets, demonstrating a realtime-feasible implementation.
...and 1 more figures

Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

TL;DR

Abstract

Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)