Table of Contents
Fetching ...

DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Franz Franco Gallo, Hui-Yin Wu, Lucile Sassatelli

TL;DR

This work proposes Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network.

Abstract

Virtual environments provide a rich and controlled setting for collecting detailed data on human behavior, offering unique opportunities for predicting human trajectories in dynamic scenes. However, most existing approaches have overlooked the potential of these environments, focusing instead on static contexts without considering userspecific factors. Employing the CREATTIVE3D dataset, our work models trajectories recorded in virtual reality (VR) scenes for diverse situations including road-crossing tasks with user interactions and simulated visual impairments. We propose Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network. We conduct extensive experiments comparing DiVR against existing architectures including MLP, LSTM, and transformers with gaze and point cloud context. Additionally, we also stress test our model's generalizability across different users, tasks, and scenes. Results show that DiVR achieves higher accuracy and adaptability compared to other models and to static graphs. This work highlights the advantages of using VR datasets for context-aware human trajectory modeling, with potential applications in enhancing user experiences in the metaverse. Our source code is publicly available at https://gitlab.inria.fr/ffrancog/creattive3d-divr-model.

DiVR: incorporating context from diverse VR scenes for human trajectory prediction

TL;DR

This work proposes Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network.

Abstract

Virtual environments provide a rich and controlled setting for collecting detailed data on human behavior, offering unique opportunities for predicting human trajectories in dynamic scenes. However, most existing approaches have overlooked the potential of these environments, focusing instead on static contexts without considering userspecific factors. Employing the CREATTIVE3D dataset, our work models trajectories recorded in virtual reality (VR) scenes for diverse situations including road-crossing tasks with user interactions and simulated visual impairments. We propose Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network. We conduct extensive experiments comparing DiVR against existing architectures including MLP, LSTM, and transformers with gaze and point cloud context. Additionally, we also stress test our model's generalizability across different users, tasks, and scenes. Results show that DiVR achieves higher accuracy and adaptability compared to other models and to static graphs. This work highlights the advantages of using VR datasets for context-aware human trajectory modeling, with potential applications in enhancing user experiences in the metaverse. Our source code is publicly available at https://gitlab.inria.fr/ffrancog/creattive3d-divr-model.

Paper Structure

This paper contains 28 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Left: A heterogeneous graph representing a pedestrian crossing scene, with nodes for different locations (H: Home, S: Sidewalks, R: Road) and objects (U: User, V: Vehicle, T: Traffic Lights, B: Button). Right: Three scenarios of road crossing are depicted. In each scenario, the red line represents the input trajectory ending at the green dot. The blue line indicates the ground truth future motion, while the magenta line shows the predicted future motion using DiVR.
  • Figure 2: Overview of the DiVR model
  • Figure 3: Context evaluation: pedestrian trajectory prediction comparison between two models, GIMO (scene pointcloud) and DiVR-Het (heterogeneous graphs) in scenes sampled from CREATTIVE3D dataset. In each scene, red lines and green dots show the observed past trajectories and their endpoints, blue lines the ground truth future trajectories, and magenta lines the predicted trajectories.
  • Figure 4: User Generalization in CREATTIVE3D: Each subplot shows different situations (e.g. leaving home, pressing button, etc) for users that were not seen on training phase for the DiVT-het model. Predicted trajectories (magenta lines), ground truth (blue lines), observed trajectory (red lines) and green dot marks their endpoints
  • Figure 5: Scene Generalization: pedestrian trajectory predictions by DiVR trained on single-lane and tested on two-lane crossing. Highlighting its capacity and limitations in adapting to complex urban layouts. Predicted trajectories (magenta lines), ground truth (blue lines), observed trajectory (red lines) and green dot marks their endpoints