Table of Contents
Fetching ...

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, Enkelejda Kasneci

TL;DR

This work tackles anticipating an agent's actions conditioned on intent from a partial video by transforming gaze-guided video segments into visual-semantic graphs and applying a Graph Neural Network to infer intention and predict subsequent atomic actions. A novel gaze-guided framework combines gaze-centered patch representations with object-aware edge attributes, processed by edge-conditioned convolutions, and a hierarchical classifier that jointly learns activity recognition and action sequence prediction. The authors introduce a VirtualHome-based eye-tracking dataset (185 videos, 18 activities, 178 atomic actions) and demonstrate that incorporating human gaze significantly improves both intention recognition and action anticipation over state-of-the-art baselines, achieving notable gains in accuracy and sequence quality. This approach highlights the value of human attention cues for robust video understanding and has practical implications for assistive robotics and smart-home systems that anticipate user goals from partial observations.

Abstract

Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

TL;DR

This work tackles anticipating an agent's actions conditioned on intent from a partial video by transforming gaze-guided video segments into visual-semantic graphs and applying a Graph Neural Network to infer intention and predict subsequent atomic actions. A novel gaze-guided framework combines gaze-centered patch representations with object-aware edge attributes, processed by edge-conditioned convolutions, and a hierarchical classifier that jointly learns activity recognition and action sequence prediction. The authors introduce a VirtualHome-based eye-tracking dataset (185 videos, 18 activities, 178 atomic actions) and demonstrate that incorporating human gaze significantly improves both intention recognition and action anticipation over state-of-the-art baselines, achieving notable gains in accuracy and sequence quality. This approach highlights the value of human attention cues for robust video understanding and has practical implications for assistive robotics and smart-home systems that anticipate user goals from partial observations.

Abstract

Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.
Paper Structure (24 sections, 3 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: The household activity (intention) of the human agent in the video is to put the cutlery (forks) into the cupboard. In the first watch step, the AI agent observes the human activity. In the second predict step, the AI agent aims to understand the human intention and predict the remaining actions in order to complete the task.
  • Figure 2: Workflow of our proposed framework, Gaze-guided Action Anticipation. Our model first predicts the gaze fixation (Left) and establishes a visual-semantic graph from the input video (Middle). Based on the graph, it solves the downstream tasks activity recognition and action prediction (Right).
  • Figure 3: Two examples of predicted actions using our model. The predicted action sequence is shown below the graph. Graphs are visualized based on the viewed video, where some nodes are omitted with "$\dots$" for a clearer view. The intention is given at the bottom.
  • Figure 4: Dataset statistics: (a) Distribution of interacting objects for each activity. (b) Distribution of atomic actions for each activity. (c) Distribution rooms for each activity.
  • Figure 5: Histogram of cosine similarity values between two visual embeddings.
  • ...and 2 more figures