Table of Contents
Fetching ...

Attention Guidance through Video Script: A Case Study of Object Focusing on 360° VR Video Tours

Paulo Vitor Santana Silva, Arthur Ricardo Sousa Vitória, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

Abstract

Within the expansive domain of virtual reality (VR), 360° VR videos immerse viewers in a spherical environment, allowing them to explore and interact with the virtual world from all angles. While this video representation offers unparalleled levels of immersion, it often lacks effective methods to guide viewers' attention toward specific elements within the virtual environment. This paper combines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study, this work conducts the experiments on a 360° video tour on the University of Reading. The experiment results show that video scripts can improve the user experience in 360° VR Videos Tour by helping in the task of directing the user's attention.

Attention Guidance through Video Script: A Case Study of Object Focusing on 360° VR Video Tours

Abstract

Within the expansive domain of virtual reality (VR), 360° VR videos immerse viewers in a spherical environment, allowing them to explore and interact with the virtual world from all angles. While this video representation offers unparalleled levels of immersion, it often lacks effective methods to guide viewers' attention toward specific elements within the virtual environment. This paper combines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study, this work conducts the experiments on a 360° video tour on the University of Reading. The experiment results show that video scripts can improve the user experience in 360° VR Videos Tour by helping in the task of directing the user's attention.
Paper Structure (6 sections, 6 figures)

This paper contains 6 sections, 6 figures.

Figures (6)

  • Figure 1: Different moments of the scene showing the museum on the video tour. It is defined in the video script as “Look at the sculpture of a person on the right side” at moment (1) and “Look at the sculpture of a centaur on the left side” at moment (2). (a) The original frame of the video. (b) The object described in the script detected and segmented. (c) The target object with the vignette effect applied.
  • Figure 2: Different frames from the 360º video showcasing distinct environments: (a) depicts an external area, (b) showcases a biology Laboratory, c) shows an external building, and (d) the gym.
  • Figure 3: General workflow for a single frame in a 360º VR Tour. a) Through a given input video description along with a selected 360º input frame $t$ as input to Grounding Dino. b) Grounding Dino selects the area (bounding-box) with higher confidence. c) The output bounding-box and image are then used as input to SAM for Object Segmentation, which outputs a segmentation mask. d) Uses the segmentation masks and bounding-box to create a vignette effect that indicates where the user must pay attention.
  • Figure 4: Different moments of the scene showing cafe-lounge on the video tour. It is defined on video script to “Look at the cafe lounge” on moment (1) and “Look at the cars between the trees” on moment (2). (a) The original frame of the video. (b) The object described on the script detected and segmented. (c) The target object with the vignette effect applied.
  • Figure 5: Grounding Dino Architecture liu2023grounding
  • ...and 1 more figures