Table of Contents
Fetching ...

Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality

Yuan Xu, Zimu Zhang, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang

TL;DR

This work tackles proactive prediction of situated user behavior in VR/AR by leveraging cognition-inspired reasoning. It introduces a hierarchical, intention-aware framework that first identifies likely interaction targets and then predicts detailed gaze, head/hand trajectories, and object interactions, guided by a context-aware dynamic Graph Convolutional Network. Key contributions include a frequency-domain observation encoder, a cognition-aligned hierarchical decoder, and a dynamic weight mechanism that adapts human–environment relationships in real time. Extensive evaluation on real VR data and the ADT dataset demonstrates improved accuracy and robustness, enabling proactive VR systems that anticipate user actions and adapt environments accordingly. The approach has practical implications for personalized VR experiences and intelligent assistive capabilities in immersive settings.

Abstract

Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.

Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality

TL;DR

This work tackles proactive prediction of situated user behavior in VR/AR by leveraging cognition-inspired reasoning. It introduces a hierarchical, intention-aware framework that first identifies likely interaction targets and then predicts detailed gaze, head/hand trajectories, and object interactions, guided by a context-aware dynamic Graph Convolutional Network. Key contributions include a frequency-domain observation encoder, a cognition-aligned hierarchical decoder, and a dynamic weight mechanism that adapts human–environment relationships in real time. Extensive evaluation on real VR data and the ADT dataset demonstrates improved accuracy and robustness, enabling proactive VR systems that anticipate user actions and adapt environments accordingly. The approach has practical implications for personalized VR experiences and intelligent assistive capabilities in immersive settings.

Abstract

Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.

Paper Structure

This paper contains 20 sections, 18 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Situated interaction behavior prediction. Given historical human dynamics, such as gaze direction and situational context, we propose an intention-aware framework to predict a person’s future behavior in the environment, including where to look (gaze), where to go (trajectory), and which objects to interact with (object interaction). Our approach employs a hierarchical prediction strategy aligned with human cognition, first coarsely identifying potential objects based on prior engagements before precisely forecasting the next specific interaction.
  • Figure 2: Overview of our framework. (1) an observation encoding module that captures and encodes historical human states and scene context, (2) a hierarchical intention-aware decoding module that first predicts potential top $K$ interaction targets and then forecasts detailed human next states as well as the object interactions, and (3) a dynamic GCN that adaptively models relationships among human gaze, head positions, hand positions, and objects.
  • Figure 3: Experimental setup for the real-world VR evaluation. (a) Meta Quest Pro headset and Touch controllers used for interaction and data collection. (b) A top-down view of the interactive virtual environment. (c) The participant's egocentric view of the interactive VR scene. (d) Third-person view of a participant in the real world performing a natural interaction sequence with the virtual environment.
  • Figure 4: Comparison of our method with PickAndPlace razali2022using in a real-world VR environment. The columns in each subfigure represent consecutive timesteps. Input sequences (a) and ground truth (b) are shown across three forms: rendered visualization, exocentric view, and egocentric view. (c) presents visualization of predictions from PickAndPlace (top) and our method (bottom). Visual elements include gaze direction (rays), head pose, hand positions (Lego figures), and interaction targets (bounding boxes). Bold bounding boxes indicate the top-$K$ interaction candidates from our hierarchical framework, while the white-to-green gradient represents increasing predicted interaction probabilities.
  • Figure 5: Quantitative comparison of our method with PickAndPlace razali2022using and GazeMotion hu24gazemotion on real-world VR evaluation. We show errors on hand, head, and object center prediction as well as object interaction AP.
  • ...and 5 more figures