Table of Contents
Fetching ...

360 in the Wild: Dataset for Depth Prediction and View Synthesis

Kibaek Park, Francois Rameau, Jaesik Park, In So Kweon

TL;DR

This work tackles the paucity of real-world 360° datasets with ground-truth pose and depth by introducing 360° in the Wild, a large-scale collection of 25K real omnidirectional images sourced from internet videos with pose and depth annotations. It benchmarks depth estimation and novel view synthesis on this dataset, adapting MiDaS for omnidirectional depth and extending NeRF++ to spherical panoramas for 360° view synthesis. The dataset spans Indoor, Outdoor, and Mannequin scenes and includes moving-object masks, enabling robust learning in diverse real-world conditions. Although ground-truth depth is not metric-scaled due to SfM/MVS limitations, the release provides video links, per-frame annotations, and sequence segmentation to support broad research on omnidirectional perception and rendering in the wild.

Abstract

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{\circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

360 in the Wild: Dataset for Depth Prediction and View Synthesis

TL;DR

This work tackles the paucity of real-world 360° datasets with ground-truth pose and depth by introducing 360° in the Wild, a large-scale collection of 25K real omnidirectional images sourced from internet videos with pose and depth annotations. It benchmarks depth estimation and novel view synthesis on this dataset, adapting MiDaS for omnidirectional depth and extending NeRF++ to spherical panoramas for 360° view synthesis. The dataset spans Indoor, Outdoor, and Mannequin scenes and includes moving-object masks, enabling robust learning in diverse real-world conditions. Although ground-truth depth is not metric-scaled due to SfM/MVS limitations, the release provides video links, per-frame annotations, and sequence segmentation to support broad research on omnidirectional perception and rendering in the wild.

Abstract

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360 videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

Paper Structure

This paper contains 15 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An example data of 360$^\circ$ in the Wild. The proposed dataset consists of omnidirectional videos, camera trajectories, successive scene depth. The dataset enables depth estimation and novel view synthesis with real-world videos. The figure shows omnidirectional RGB, scene depth, depth prediction by Hu2019RevisitingSI, and novel view synthesis represented as perspective views by using extended NeRF++ nerfpp, from the top row to bottom, respectively. Images are cropped for the best view.
  • Figure 2: Statistics of depth values in our dataseet. Each category has different distribution because of the scene property. The number of pixels is displayed in log scale for better visualization.
  • Figure 3: Sample omnidirectional images from the 360$^\circ$ in the Wild. Each row is grouped by category; Indoor, Outdoor, and Mannequin challenge sequences, respectively. The person capturing the video is manually masked out, since the person (with a gimble or camera stick) consistently appears in the video, and it does not become relevant to the scene context. Images are cropped for the best view. Additional samples are available in the supplementary material.
  • Figure 4: Qualitative comparison results of 360$^\circ$ novel view synthesis. (a) Generated image with the official implementation of NeRF++ nerfpp. Note that NeRF++ is not designed for 360$^\circ$ images and is included only to validate the necessity of our adaptation (b) Results of extended NeRF++ for 360$^\circ$ images. (c) Ground-truth RGB images. Images are cropped for the best view. The images are taken from Mannequin-1 sequence (The first row), Outdoor-1 sequence (the second row) and Indoor-1 sequence (the third row) in Table \ref{['tbl:nerfpp_and_ours']}. Additional examples are available in the supplement.
  • Figure 5: 360$^\circ$ view synthesis with masked videos. Extended NeRF++ that is trained on our dataset can generate images of the masked region. The network is trained with only a masked input image sequence. For better comparison, original images with a blurred human face are shown on the left. The images are taken from the Indoor-m1 sequence (first row) and the Indoor-m2 sequence (second row) in Table \ref{['tbl:nerfpp_and_ours']}. Please see the supplement for the video.
  • ...and 2 more figures