Table of Contents
Fetching ...

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, Enkelejda Kasneci

TL;DR

This work tackles third-person gaze prediction to enable gaze-aware video understanding without real-time human input. It introduces a transformer-based reinforcement learning framework (Decision Transformer) that conditions gaze action generation on the state $s_t$, current gaze $p_t$, and return-to-go $R_t$, producing next gaze coordinates $\hat{\mathbf{p}}_{t+1}=Q(\tau_{\{t-L:t\}})$ while maximizing $\mathbb{E}[\sum_t r_t]$ and minimizing $\mathcal{L}_{MSE}=\frac{1}{N}\sum_t (\mathbf{p}_t-\hat{\mathbf{p}}_t)^2$. The model uses a ResNet-50 visual backbone and a masked gaze region to focus on fixation content, and is trained on VirtualHome eye-tracking data for activity recognition. Experimental results show the RL gaze predictor outperforms baselines and provides useful signals for downstream gaze-guided tasks, achieving competitive performance when real gaze data are unavailable and reducing reliance on human input for video analysis, with $\hat{\mathbf{p}}_{t+1}$ guiding future gaze decisions.

Abstract

Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

TL;DR

This work tackles third-person gaze prediction to enable gaze-aware video understanding without real-time human input. It introduces a transformer-based reinforcement learning framework (Decision Transformer) that conditions gaze action generation on the state , current gaze , and return-to-go , producing next gaze coordinates while maximizing and minimizing . The model uses a ResNet-50 visual backbone and a masked gaze region to focus on fixation content, and is trained on VirtualHome eye-tracking data for activity recognition. Experimental results show the RL gaze predictor outperforms baselines and provides useful signals for downstream gaze-guided tasks, achieving competitive performance when real gaze data are unavailable and reducing reliance on human input for video analysis, with guiding future gaze decisions.

Abstract

Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.
Paper Structure (14 sections, 3 equations, 2 figures, 2 tables)

This paper contains 14 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Sample frames from the eye-tracking dataset generated with the VirtualHome platform.
  • Figure 2: Trajectory comparison of ground-truth fixation (gray) and our prediction (green). Each dot represents a fixation point.