Table of Contents
Fetching ...

NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

Jun Yu, Yifan Zhang, Badrinadh Aila, Vinod Namboodiri

TL;DR

An image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings and validate the prospect of image-based solution towards indoor navigation.

Abstract

Indoor navigation is challenging due to the absence of satellite positioning. This challenge is manifold greater for Visually Impaired People (VIPs) who lack the ability to get information from wayfinding signage. Other sensor signals (e.g., Bluetooth and LiDAR) can be used to create turn-by-turn navigation solutions with position updates for users. Unfortunately, these solutions require tags to be installed all around the environment or the use of fairly expensive hardware. Moreover, these solutions require a high degree of manual involvement that raises costs, thus hampering scalability. We propose an image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings. Specifically, we start by curating large-scale phone camera data in a four-floor research building, with 300K images, to lay the foundation for creating an image-centric indoor navigation and exploration solution for inclusiveness. Every image is labelled with precise 6DoF camera poses, details of indoor PoIs, and descriptive captions to assist VIPs. We benchmark on two main aspects: 1) positioning system and 2) exploration support, prioritizing training scalability and real-time inference, to validate the prospect of image-based solution towards indoor navigation. The dataset, code, and model checkpoints are made publicly available at https://github.com/junfish/VIP_Navi.

NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

TL;DR

An image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings and validate the prospect of image-based solution towards indoor navigation.

Abstract

Indoor navigation is challenging due to the absence of satellite positioning. This challenge is manifold greater for Visually Impaired People (VIPs) who lack the ability to get information from wayfinding signage. Other sensor signals (e.g., Bluetooth and LiDAR) can be used to create turn-by-turn navigation solutions with position updates for users. Unfortunately, these solutions require tags to be installed all around the environment or the use of fairly expensive hardware. Moreover, these solutions require a high degree of manual involvement that raises costs, thus hampering scalability. We propose an image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings. Specifically, we start by curating large-scale phone camera data in a four-floor research building, with 300K images, to lay the foundation for creating an image-centric indoor navigation and exploration solution for inclusiveness. Every image is labelled with precise 6DoF camera poses, details of indoor PoIs, and descriptive captions to assist VIPs. We benchmark on two main aspects: 1) positioning system and 2) exploration support, prioritizing training scalability and real-time inference, to validate the prospect of image-based solution towards indoor navigation. The dataset, code, and model checkpoints are made publicly available at https://github.com/junfish/VIP_Navi.

Paper Structure

This paper contains 36 sections, 2 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of pipelines for purely image-based indoor navigation. We collect videos and extract image frames as data sources. Each image is annotated with: 1) 6-DoF camera poses aligned with the floor plan, 2) indoor points-of-interest (PoIs), and 3) visual descriptions that assist visually impaired people (VIPs) in understanding their surroundings. We highlight the task scalability of this solution, facilitated by its end-to-end training and inference using simple image forward pass.
  • Figure 2: This illustration outlines the steps for converting camera poses from initial coordinates to a unified floor plan world coordinate system: 1) Independent 3D scene construction for images from each video using COLMAP, executed in parallel; 2) Pinpointing several anchor images from each video to the floor plan and geo-registering camera poses using these anchors; 3) Aligning and validating the entire path onto the floor plan; 4) Combining all images from different paths for unified training purposes.
  • Figure 3: Floor plan dependent PoIs can be reported simultaneously as the camera pose is determined.
  • Figure 4: CDF of errors on MS-Transformer.
  • Figure 5: Movies grabbed from the 3D reconstruction of the project 20231220_141254_proj/.
  • ...and 8 more figures