Efficient Navigation in Unknown Indoor Environments with Vision-Language Models
D. Schwartz, K. Kondo, J. P. How
TL;DR
This work tackles efficient navigation in unknown indoor environments with many dead ends by introducing a vision-language-model (VLM)–based high-level planner that reasons about partial occupancy maps. By projecting a 3D occupancy grid into a 2D map and generating frontier-based subgoals, the VLM scores options and guides a traditional trajectory planner (DYNUS) to avoid detours into small rooms. Key contributions include frontier clustering for subgoal generation, median-based aggregation with MAD for robustness, and empirically demonstrated ~ten percent reductions in path length in Gazebo simulations. The approach requires modest prior knowledge and integrates cleanly with existing autonomy stacks, offering improved routing under partial observability, though model latency remains a practical bottleneck and future work could add region-level semantics for richer planning.
Abstract
We present a novel high-level planning framework that leverages vision-language models (VLMs) to improve autonomous navigation in unknown indoor environments with many dead ends. Traditional exploration methods often take inefficient routes due to limited global reasoning and reliance on local heuristics. In contrast, our approach enables a VLM to reason directly about occupancy maps in a zero-shot manner, selecting subgoals that are likely to yield more efficient paths. At each planning step, we convert a 3D occupancy grid into a partial 2D map of the environment, and generate candidate subgoals. Each subgoal is then evaluated and ranked against other candidates by the model. We integrate this planning scheme into DYNUS \cite{kondo2025dynus}, a state-of-the-art trajectory planner, and demonstrate improved navigation efficiency in simulation. The VLM infers structural patterns (e.g., rooms, corridors) from incomplete maps and balances the need to make progress toward a goal against the risk of entering unknown space. This reduces common greedy failures (e.g., detouring into small rooms) and achieves about 10\% shorter paths on average.
