Table of Contents
Fetching ...

Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling

Jyotirmay Nag Setu, Kevin Desai, John Quarles

TL;DR

The paper tackles the problem of predicting cybersickness severity during VR gameplay without external sensors by introducing a sensor-free pipeline that combines ImageNet-pretrained InceptionV3 frame features with LSTM-based temporal modeling. Frames are sampled at 1 FPS and compressed via temporal max pooling to form sequences for the LSTM, achieving a robust $68.44\%$ accuracy across four severity levels on the VRWalking dataset. This approach demonstrates that high-level semantic visual features, when modeled over time, can effectively predict cybersickness in interactive VR, offering a scalable alternative to physiologic sensing. The work lays groundwork for video-based temporal models in VR comfort research and suggests directions toward multimodal and attention-based extensions.

Abstract

With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.

Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling

TL;DR

The paper tackles the problem of predicting cybersickness severity during VR gameplay without external sensors by introducing a sensor-free pipeline that combines ImageNet-pretrained InceptionV3 frame features with LSTM-based temporal modeling. Frames are sampled at 1 FPS and compressed via temporal max pooling to form sequences for the LSTM, achieving a robust accuracy across four severity levels on the VRWalking dataset. This approach demonstrates that high-level semantic visual features, when modeled over time, can effectively predict cybersickness in interactive VR, offering a scalable alternative to physiologic sensing. The work lays groundwork for video-based temporal models in VR comfort research and suggests directions toward multimodal and attention-based extensions.

Abstract

With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.

Paper Structure

This paper contains 13 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Training and Validation Loss over Folds
  • Figure 2: Temporal Importance - Standard Gradients
  • Figure 3: Temporal Importance - Integrated Gradients