360+x: A Panoptic Multi-modal Scene Understanding Dataset
Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, Jianbo Jiao
TL;DR
We introduce 360+x, the first panoptic multi-modal dataset covering multiple viewpoints (360° panorama, third-person front view, and egocentric monocular/binocular) with aligned video, audio, directional binaural delay, GPS, and textual scene descriptions. The dataset enables five benchmark tasks—video scene classification, temporal action localization, cross-modality retrieval, self-supervised representation learning, and dataset adaptation—allowing systematic study of the impact of viewpoint and modality on panoptic scene understanding. Through extensive experiments, the authors show that incorporating additional views and modalities consistently boosts performance, and that self-supervised pre-training on 360+x can outperform purely supervised baselines, while pre-training on 360+x also improves transfer to external datasets like THUMOS14. The work provides a rich, privacy-conscious resource with a hierarchical multimodal fusion framework, promoting research across vision, audio, and spatial perception and enabling robust, cross-domain scene understanding.
Abstract
Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.
