Table of Contents
Fetching ...

ShelfAware: Real-Time Visual-Inertial Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes

TL;DR

Quasi-static indoor spaces pose severe localization challenges due to repetitive geometry and drifting local semantics. ShelfAware couples a depth-based geometry model with a distributional semantic representation of object categories and leverages an offline/online inverse semantic model to propose high-quality pose hypotheses, enabling rapid global localization on low-cost vision hardware. The approach is implemented as a semantic particle filter that fuses depth likelihoods with a semantic similarity score and uses a precomputed semantic-view bank for fast localization, validated in a mock grocery store with wearable and cart-mounted configurations. Results show 96% global-localization success, fast convergence (mean ~1.91 s), and robust tracking across dynamic occlusions and sparse semantics, outperforming MCL and AMCL while running in real time on a laptop.

Abstract

Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.

ShelfAware: Real-Time Visual-Inertial Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

TL;DR

Quasi-static indoor spaces pose severe localization challenges due to repetitive geometry and drifting local semantics. ShelfAware couples a depth-based geometry model with a distributional semantic representation of object categories and leverages an offline/online inverse semantic model to propose high-quality pose hypotheses, enabling rapid global localization on low-cost vision hardware. The approach is implemented as a semantic particle filter that fuses depth likelihoods with a semantic similarity score and uses a precomputed semantic-view bank for fast localization, validated in a mock grocery store with wearable and cart-mounted configurations. Results show 96% global-localization success, fast convergence (mean ~1.91 s), and robust tracking across dynamic occlusions and sparse semantics, outperforming MCL and AMCL while running in real time on a laptop.

Abstract

Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.

Paper Structure

This paper contains 23 sections, 7 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: An overview of ShelfAware. A) A mock grocery environment used for evaluation, where semantic observations are obtained via chest-mounted camera system. B) Depth-based observation models in particle filtering rely solely on geometric features, which are ambiguous in long, repetitive aisles and lead to weak particle discrimination. C) ShelfAware injects particles based on semantic cues, enabling more distinctive and robust particle weighting combined with the depth observation model and improved global localization in retail-like environments.
  • Figure 2: 3D semantic map overlaid on the 2D occupancy grid. Each voxel stores a distribution over object class counts. Ray casting on this semantic layer yields the expected semantic vector $\mathbf{v}_{\text{sem}}$ comprising class counts, distances, and angles.
  • Figure 3: Semantic vector $\mathbf{v}_{\text{sem}} = [\mathbf{v}_c, \mathbf{v}_\theta, \mathbf{v}_d]$. The count vector $\mathbf{v}_c$ captures the number of items detected in each class at a given pose; $\mathbf{v}_\theta$ and $\mathbf{v}_d$ capture mean relative bearings and ranges for each visible class.
  • Figure 4: Data-flow diagram for ShelfAware. The semantic particle filter fuses depth likelihood with semantic likelihood and uses an inverse semantic model to propose high-quality particles for global localization and recovery.
  • Figure 5: ShelfAware hardware. A lightweight two-camera system with a 3D-printed mount was used throughout our experiments (top). This design allowed evaluation across a wearable chest mount (left) and a cart-mounted setup (right).
  • ...and 3 more figures