Table of Contents
Fetching ...

Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses

Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling

TL;DR

This work investigates gaze prediction by analyzing eye–body coordination across real-world, VR, and AR datasets, revealing that eye gaze strongly aligns with head direction and full-body poses, with distinct temporal relationships in object versus human-human interactions. It then introduces Pose2Gaze, a three-branch model that fuses body-orientation features from head directions with spatio-temporal body-motion features via CNNs and GCNs to predict gaze directions, outperforming head-only baselines across four datasets for past, present, and future input poses. Extensive experiments show significant mean angular error reductions (up to ~28.6% depending on dataset) and demonstrate the method’s value for downstream tasks like eye-based activity recognition, with practical implications for gaze-contingent rendering and VR/AR interaction. The results establish a new direction in pose-based gaze prediction by leveraging full-body information to capture eye–body coordination in daily activities, enabling more accurate and responsive virtual agents and interfaces.

Abstract

Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities based on four public datasets collected in real-world (MoGaze), VR (ADT), as well as AR (GIMO and EgoBody) environments. We show that in human-object interactions, e.g. pick and place, eye gaze exhibits strong correlations with full-body motion while in human-human interactions, e.g. chat and teach, a person's gaze direction is correlated with the body orientation towards the interaction partner. Informed by these analyses we then present Pose2Gaze, a novel eye-body coordination model that uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head direction and full-body poses, respectively, and then uses a convolutional neural network to predict eye gaze. We compare our method with state-of-the-art methods that predict eye gaze only from head movements and show that Pose2Gaze outperforms these baselines with an average improvement of 24.0% on MoGaze, 10.1% on ADT, 21.3% on GIMO, and 28.6% on EgoBody in mean angular error, respectively. We also show that our method significantly outperforms prior methods in the sample downstream task of eye-based activity recognition. These results underline the significant information content available in eye-body coordination during daily activities and open up a new direction for gaze prediction.

Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses

TL;DR

This work investigates gaze prediction by analyzing eye–body coordination across real-world, VR, and AR datasets, revealing that eye gaze strongly aligns with head direction and full-body poses, with distinct temporal relationships in object versus human-human interactions. It then introduces Pose2Gaze, a three-branch model that fuses body-orientation features from head directions with spatio-temporal body-motion features via CNNs and GCNs to predict gaze directions, outperforming head-only baselines across four datasets for past, present, and future input poses. Extensive experiments show significant mean angular error reductions (up to ~28.6% depending on dataset) and demonstrate the method’s value for downstream tasks like eye-based activity recognition, with practical implications for gaze-contingent rendering and VR/AR interaction. The results establish a new direction in pose-based gaze prediction by leveraging full-body information to capture eye–body coordination in daily activities, enabling more accurate and responsive virtual agents and interfaces.

Abstract

Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities based on four public datasets collected in real-world (MoGaze), VR (ADT), as well as AR (GIMO and EgoBody) environments. We show that in human-object interactions, e.g. pick and place, eye gaze exhibits strong correlations with full-body motion while in human-human interactions, e.g. chat and teach, a person's gaze direction is correlated with the body orientation towards the interaction partner. Informed by these analyses we then present Pose2Gaze, a novel eye-body coordination model that uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head direction and full-body poses, respectively, and then uses a convolutional neural network to predict eye gaze. We compare our method with state-of-the-art methods that predict eye gaze only from head movements and show that Pose2Gaze outperforms these baselines with an average improvement of 24.0% on MoGaze, 10.1% on ADT, 21.3% on GIMO, and 28.6% on EgoBody in mean angular error, respectively. We also show that our method significantly outperforms prior methods in the sample downstream task of eye-based activity recognition. These results underline the significant information content available in eye-body coordination during daily activities and open up a new direction for gaze prediction.
Paper Structure (51 sections, 5 equations, 5 figures, 6 tables)

This paper contains 51 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The cosine similarities between head and gaze directions at different time intervals in the (a) MoGaze (b) ADT (c) GIMO and (d) EgoBody datasets. The highest correlations occur at between $-100$ and $200$$ms$, suggesting that there is little or no time delay between head and eye movements.
  • Figure 2: The cosine similarities between eye gaze and body motions at different time intervals in the (a) MoGaze (b) ADT (c) GIMO and (d) EgoBody datasets. The highest correlations occur at between $400$ and $1500$$ms$, indicating that eye movements precede body motions.
  • Figure 3: (a) Eye gaze and the directions pointing from a person's body to the body of the interaction partner and (b) the cosine similarities between eye gaze and the directions between two bodies at different time intervals. The highest correlations either occur at $0$$ms$ or are very close to the correlation values at $0$$ms$, suggesting that there is little or no time delay between body motions and eye gaze.
  • Figure 4: Architecture of the proposed Pose2Gaze model. Pose2Gaze first uses a 1D convolutional neural network to extract body orientation features from head directions, then applies a spatio-temporal graph convolutional neural network to extract the body motion features from human full-body poses, and finally employs a 1D convolutional neural network to generate human eye gaze from the extracted body orientation and motion features.
  • Figure 5: Results of different methods for generating eye gaze from present poses on the MoGaze and GIMO datasets. The green line indicates the ground truth while the blue line represents the predicted eye gaze.