Table of Contents
Fetching ...

Bootstrap Perception Under Hardware Depth Failure for Indoor Robot Navigation

Nishant Pushparaju, Vivek Mattam, Aliasghar Arab

Abstract

We present a bootstrap perception system for indoor robot navigation under hardware depth failure. In our corridor data, the time-of-flight camera loses up to 78% of its depth pixels on reflective surfaces, yet a 2D LiDAR alone cannot sense obstacles above its scan plane. Our system exploits a self-referential property of this failure: the sensor's surviving valid pixels calibrate learned monocular depth to metric scale, so the system fills its own gaps without external data. The architecture forms a failure-aware sensing hierarchy, conservative when sensors work and filling in when they fail: LiDAR remains the geometric anchor, hardware depth is kept where valid, and learned depth enters only where needed. In corridor and dynamic pedestrian evaluations, selective fusion increases costmap obstacle coverage by 55-110% over LiDAR alone. A compact distilled student runs at 218\,FPS on a Jetson Orin Nano and achieves 9/10 navigation success with zero collisions in closed-loop simulation, matching the ground-truth depth baseline at a fraction of the foundation model's cost.

Bootstrap Perception Under Hardware Depth Failure for Indoor Robot Navigation

Abstract

We present a bootstrap perception system for indoor robot navigation under hardware depth failure. In our corridor data, the time-of-flight camera loses up to 78% of its depth pixels on reflective surfaces, yet a 2D LiDAR alone cannot sense obstacles above its scan plane. Our system exploits a self-referential property of this failure: the sensor's surviving valid pixels calibrate learned monocular depth to metric scale, so the system fills its own gaps without external data. The architecture forms a failure-aware sensing hierarchy, conservative when sensors work and filling in when they fail: LiDAR remains the geometric anchor, hardware depth is kept where valid, and learned depth enters only where needed. In corridor and dynamic pedestrian evaluations, selective fusion increases costmap obstacle coverage by 55-110% over LiDAR alone. A compact distilled student runs at 218\,FPS on a Jetson Orin Nano and achieves 9/10 navigation success with zero collisions in closed-loop simulation, matching the ground-truth depth baseline at a fraction of the foundation model's cost.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Depth fusion under worst-case ToF failure (20% valid pixels). The hardware sensor retains only a sparse stippled pattern; DA3-Small supplies dense depth estimates in the regions where the hardware sensor fails. Even at 80% pixel failure, the fused output combines surviving hardware depth with learned depth to produce a dense depth estimate for costmap construction.
  • Figure 2: System architecture. LiDAR remains the primary geometric reference, hardware depth is used where valid, learned monocular depth replaces unreliable active-depth regions, and optional semantics refine obstacle footprint only when needed. Heavy teachers remain off-board. Dashed orange paths are optional.
  • Figure 3: Depth fusion on a representative corridor frame (43% valid ToF pixels). Top: RGB input showing polished floor and glass doors, sparse sensor depth with large invalid regions, and DA3-Small depth prediction. Bottom: V9 corridor-specialist student, and both fused outputs (sensor+DA3, sensor+V9). Learned depth fills the floor and glass regions where the ToF sensor returns zero depth.
  • Figure 4: False-positive source decomposition (DA3-Small, $n{=}459$ corridor frames). Sensor-invalid fill (34.6%) represents structure added in ToF dead-pixel regions; hallucination (49.3%) is free-space false occupancy; inflation artifacts (18.1%) arise from Nav2 inflation of false cells.
  • Figure 5: Live Nav2 costmap during corridor replay. Left: hardware depth (top, mostly invalid) and RGB camera (bottom). Right: costmap overlay showing DA3-Small depth contributions (green) and V9 student depth (blue). Learned depth detects chairs and tables visible through the glass wall where the ToF sensor returns invalid depth.
  • ...and 1 more figures