Table of Contents
Fetching ...

EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics

Xiaochuan Liu, Xin Cheng, Yuchong Sun, Xiaoxue Wu, Ruihua Song, Hao Sun, Denghao Zhang

TL;DR

This work tackles predicting human gaze trajectories in visual scenes under synchronized audio by introducing EyEar, a physics-informed dynamical system that blends intrinsic eye motion, saliency attraction, and audio semantic attraction. It leverages three specialized modules—a dynamical system with learned forces, a multimodal attention-based predictor of audio-semantic attraction points, and a probability density loss based on Gaussian mixtures to handle high inter-subject variability. The authors contribute a 20k-point eye-tracking dataset collected from 8 subjects, a novel model combining image, text, and audio streams, and a distribution-based evaluation metric that better captures human-like gaze behavior. EyEar achieves state-of-the-art performance across multiple metrics, notably improving over baselines by 4–15%, and demonstrates improved realism in gaze dynamics, with implications for more lifelike virtual characters and multimodal scene understanding.

Abstract

Imitating how humans move their gaze in a visual scene is a vital research problem for both visual understanding and psychology, kindling crucial applications such as building alive virtual characters. Previous studies aim to predict gaze trajectories when humans are free-viewing an image, searching for required targets, or looking for clues to answer questions in an image. While these tasks focus on visual-centric scenarios, humans move their gaze also along with audio signal inputs in more common scenarios. To fill this gap, we introduce a new task that predicts human gaze trajectories in a visual scene with synchronized audio inputs and provide a new dataset containing 20k gaze points from 8 subjects. To effectively integrate audio information and simulate the dynamic process of human gaze motion, we propose a novel learning framework called EyEar (Eye moving while Ear listening) based on physics-informed dynamics, which considers three key factors to predict gazes: eye inherent motion tendency, vision salient attraction, and audio semantic attraction. We also propose a probability density score to overcome the high individual variability of gaze trajectories, thereby improving the stabilization of optimization and the reliability of the evaluation. Experimental results show that EyEar outperforms all the baselines in the context of all evaluation metrics, thanks to the proposed components in the learning model.

EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics

TL;DR

This work tackles predicting human gaze trajectories in visual scenes under synchronized audio by introducing EyEar, a physics-informed dynamical system that blends intrinsic eye motion, saliency attraction, and audio semantic attraction. It leverages three specialized modules—a dynamical system with learned forces, a multimodal attention-based predictor of audio-semantic attraction points, and a probability density loss based on Gaussian mixtures to handle high inter-subject variability. The authors contribute a 20k-point eye-tracking dataset collected from 8 subjects, a novel model combining image, text, and audio streams, and a distribution-based evaluation metric that better captures human-like gaze behavior. EyEar achieves state-of-the-art performance across multiple metrics, notably improving over baselines by 4–15%, and demonstrates improved realism in gaze dynamics, with implications for more lifelike virtual characters and multimodal scene understanding.

Abstract

Imitating how humans move their gaze in a visual scene is a vital research problem for both visual understanding and psychology, kindling crucial applications such as building alive virtual characters. Previous studies aim to predict gaze trajectories when humans are free-viewing an image, searching for required targets, or looking for clues to answer questions in an image. While these tasks focus on visual-centric scenarios, humans move their gaze also along with audio signal inputs in more common scenarios. To fill this gap, we introduce a new task that predicts human gaze trajectories in a visual scene with synchronized audio inputs and provide a new dataset containing 20k gaze points from 8 subjects. To effectively integrate audio information and simulate the dynamic process of human gaze motion, we propose a novel learning framework called EyEar (Eye moving while Ear listening) based on physics-informed dynamics, which considers three key factors to predict gazes: eye inherent motion tendency, vision salient attraction, and audio semantic attraction. We also propose a probability density score to overcome the high individual variability of gaze trajectories, thereby improving the stabilization of optimization and the reliability of the evaluation. Experimental results show that EyEar outperforms all the baselines in the context of all evaluation metrics, thanks to the proposed components in the learning model.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Different from existing tasks which predict gazes in situations such as (a) humans free-viewing an image, or (b) searching for required targets or clues to answer questions in an image, our proposed task (c) aims to predict human gaze in a more common scenario where humans receive synchronized audio signals when directing their gaze.
  • Figure 2: EyEar overview. The core component is a physics-informed audio-aware dynamical system that simulates the motion of eyes (See Module 1). The next predicted gaze point is calculated from the current gaze point, the time interval, and a motion vector. The motion vector is influenced by three kinds of forces. The most important force, audio semantic attraction force is predicted by Module 2. We propose probability density loss (See Module 3) to train the model.
  • Figure 3: Illustration of our probability density score. (a) An example image with the gaze points of multiple subjects when they heard "the computer". (b) Its corresponding ground-truth distribution $\hat{P}_i$ visualized in a 3D way.
  • Figure 4: Visualization of the predicted gaze trajectories of different models and the ground-truth human gaze trajectories. Best viewed in color. We provide word-to-word translations for better understanding.
  • Figure 5: A radar chart showing the effect of DynS. Gaze trajectories are decomposed into saccade vectors pointing from the previous gaze point to the next. The degree of the polar coordinate represents the angle between the vectors and the horizontal direction. The radius of the polar coordinate is the length of the vectors. The heat map refers to the speed of vectors, calculated by length/duration.