Table of Contents
Fetching ...

360+x: A Panoptic Multi-modal Scene Understanding Dataset

Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, Jianbo Jiao

TL;DR

We introduce 360+x, the first panoptic multi-modal dataset covering multiple viewpoints (360° panorama, third-person front view, and egocentric monocular/binocular) with aligned video, audio, directional binaural delay, GPS, and textual scene descriptions. The dataset enables five benchmark tasks—video scene classification, temporal action localization, cross-modality retrieval, self-supervised representation learning, and dataset adaptation—allowing systematic study of the impact of viewpoint and modality on panoptic scene understanding. Through extensive experiments, the authors show that incorporating additional views and modalities consistently boosts performance, and that self-supervised pre-training on 360+x can outperform purely supervised baselines, while pre-training on 360+x also improves transfer to external datasets like THUMOS14. The work provides a rich, privacy-conscious resource with a hierarchical multimodal fusion framework, promoting research across vision, audio, and spatial perception and enabling robust, cross-domain scene understanding.

Abstract

Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.

360+x: A Panoptic Multi-modal Scene Understanding Dataset

TL;DR

We introduce 360+x, the first panoptic multi-modal dataset covering multiple viewpoints (360° panorama, third-person front view, and egocentric monocular/binocular) with aligned video, audio, directional binaural delay, GPS, and textual scene descriptions. The dataset enables five benchmark tasks—video scene classification, temporal action localization, cross-modality retrieval, self-supervised representation learning, and dataset adaptation—allowing systematic study of the impact of viewpoint and modality on panoptic scene understanding. Through extensive experiments, the authors show that incorporating additional views and modalities consistently boosts performance, and that self-supervised pre-training on 360+x can outperform purely supervised baselines, while pre-training on 360+x also improves transfer to external datasets like THUMOS14. The work provides a rich, privacy-conscious resource with a hierarchical multimodal fusion framework, promoting research across vision, audio, and spatial perception and enabling robust, cross-domain scene understanding.

Abstract

Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.
Paper Structure (57 sections, 34 figures, 12 tables)

This paper contains 57 sections, 34 figures, 12 tables.

Figures (34)

  • Figure 1: Example 360$^{\circ}$ panoramics videos from all 28 scene categories.
  • Figure 2: Illustration of the proposed 360+x dataset. The 360$^{\circ}$ camera records fish-eye raw videos with front and back lenses. These videos are merged to create a spherical 360$^{\circ}$ panorama (middle-up figure, zoom in for details), which is then transformed to (a) 360$^{\circ}$ panoramic data using equirectangular projection. The (b) third-person front view is obtained by de-warping the rich movements region highlighted red in the spherical field of 360$^{\circ}$ panorama (the middle-left figure). By wearing stereo cameras, the capturers record (c) egocentric clips while staying visible to the fixed 360$^{\circ}$ camera (central ellipse). (e) Directional audio time delay data is generated from left and right audio inputs (d) from the 360$^{\circ}$ camera by interaural time delay process chen2022sound. This helps locate sound sources in the 360$^{\circ}$ panorama.
  • Figure 3: Dataset statistics analysis, on the distributions of (a) the scene category, (b) action distribution per cities, (c) temporal action instance duration, and (d) number of actions per video, (e) capturing time, (f) binaural delay per clip.
  • Figure 4: Additional dataset indoor/outdoor statistics.
  • Figure 5: Elucidation of the self-supervised learning (SSL) techniques employed in our study: within SSL, audio is treated in tandem with video frames. To illustrate, when the video speed is augmented by a factor of 2, the audio sample rate is attenuated by 2 (thus speeding it up) to maintain synchronisation. Correspondingly, if the sequence of video clips is rearranged, the audio clips undergo a commensurate reshuffling. The processing of ITD data mirrors this approach used for audio data.
  • ...and 29 more figures