Table of Contents
Fetching ...

SANPO: A Scene Understanding, Accessibility and Human Navigation Dataset

Sagar M. Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko

TL;DR

SANPO fills a critical gap in outdoor, egocentric scene understanding by providing a large-scale real and synthetic dataset tailored for human navigation. It delivers dense panoptic segmentation and depth for real data, plus pixel-perfect synthetic annotations, enabling robust cross-domain evaluation and synthetic-to-real adaptation. The work demonstrates significant domain gaps for current segmentation and depth models in SANPO, establishes comprehensive benchmarks, and shows practical applicability through on-device obstacle detection and projects like Project Guideline. By releasing both data and mobile-ready baselines under CC BY 4.0, SANPO accelerates development of assistive navigation technologies for visually impaired users and advances research in domain adaptation and autonomous navigation from a human-centric perspective.

Abstract

Vision is essential for human navigation. The World Health Organization (WHO) estimates that 43.3 million people were blind in 2020, and this number is projected to reach 61 million by 2050. Modern scene understanding models could empower these people by assisting them with navigation, obstacle avoidance and visual recognition capabilities. The research community needs high quality datasets for both training and evaluation to build these systems. While datasets for autonomous vehicles are abundant, there is a critical gap in datasets tailored for outdoor human navigation. This gap poses a major obstacle to the development of computer vision based Assistive Technologies. To overcome this obstacle, we present SANPO, a large-scale egocentric video dataset designed for dense prediction in outdoor human navigation environments. SANPO contains 701 stereo videos of 30+ seconds captured in diverse real-world outdoor environments across four geographic locations in the USA. Every frame has a high resolution depth map and 112K frames were annotated with temporally consistent dense video panoptic segmentation labels. The dataset also includes 1961 high-quality synthetic videos with pixel accurate depth and panoptic segmentation annotations to balance the noisy real world annotations with the high precision synthetic annotations. SANPO is already publicly available and is being used by mobile applications like Project Guideline to train mobile models that help low-vision users go running outdoors independently. To preserve anonymization during peer review, we will provide a link to our dataset upon acceptance. SANPO is available here: https://google-research-datasets.github.io/sanpo_dataset/

SANPO: A Scene Understanding, Accessibility and Human Navigation Dataset

TL;DR

SANPO fills a critical gap in outdoor, egocentric scene understanding by providing a large-scale real and synthetic dataset tailored for human navigation. It delivers dense panoptic segmentation and depth for real data, plus pixel-perfect synthetic annotations, enabling robust cross-domain evaluation and synthetic-to-real adaptation. The work demonstrates significant domain gaps for current segmentation and depth models in SANPO, establishes comprehensive benchmarks, and shows practical applicability through on-device obstacle detection and projects like Project Guideline. By releasing both data and mobile-ready baselines under CC BY 4.0, SANPO accelerates development of assistive navigation technologies for visually impaired users and advances research in domain adaptation and autonomous navigation from a human-centric perspective.

Abstract

Vision is essential for human navigation. The World Health Organization (WHO) estimates that 43.3 million people were blind in 2020, and this number is projected to reach 61 million by 2050. Modern scene understanding models could empower these people by assisting them with navigation, obstacle avoidance and visual recognition capabilities. The research community needs high quality datasets for both training and evaluation to build these systems. While datasets for autonomous vehicles are abundant, there is a critical gap in datasets tailored for outdoor human navigation. This gap poses a major obstacle to the development of computer vision based Assistive Technologies. To overcome this obstacle, we present SANPO, a large-scale egocentric video dataset designed for dense prediction in outdoor human navigation environments. SANPO contains 701 stereo videos of 30+ seconds captured in diverse real-world outdoor environments across four geographic locations in the USA. Every frame has a high resolution depth map and 112K frames were annotated with temporally consistent dense video panoptic segmentation labels. The dataset also includes 1961 high-quality synthetic videos with pixel accurate depth and panoptic segmentation annotations to balance the noisy real world annotations with the high precision synthetic annotations. SANPO is already publicly available and is being used by mobile applications like Project Guideline to train mobile models that help low-vision users go running outdoors independently. To preserve anonymization during peer review, we will provide a link to our dataset upon acceptance. SANPO is available here: https://google-research-datasets.github.io/sanpo_dataset/
Paper Structure (57 sections, 15 figures, 7 tables)

This paper contains 57 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: SANPO is the only human-egocentric dataset with panoptic masks, multi-view stereo, depth, camera pose, and both real and synthetic data. SANPO has the largest number of panoptic frames among related work and a respectable number of depth annotations. (Note: $^1$: multi-view, $^2$: partial coverage, $^3$: sparse depth, $^4$: sparse segmentation)
  • Figure 2: SANPO Real Sample. Top row shows a stereo left frame from a session along with its metric depth and segmentation annotations. Bottom row shows the 3D scene of the session built using the annotations we provide. Points from several seconds of video are accumulated and aligned with ICP.
  • Figure 3: SANPO-Real environment diversity, showing the distribution of video-level annotations for 11 of the 12 attributes. Each pie chart shows that annotation over all 701 sessions.
  • Figure 4: SANPO-Synthetic Sample. Right column shows a single frame from a synthetic session along with its metric depth and segmentation annotation. Left column shows the 3D scene of the session built using the annotations. Points come from the accumulated depth maps and camera locations across many frames.
  • Figure 5: Semantic label occurrences in SANPO: Common human navigation specific labels like Building, Obstacle, Pole, Tree, Curb, Sidewalk etc. feature more prominently.
  • ...and 10 more figures