Table of Contents
Fetching ...

Efficient Navigation in Unknown Indoor Environments with Vision-Language Models

D. Schwartz, K. Kondo, J. P. How

TL;DR

This work tackles efficient navigation in unknown indoor environments with many dead ends by introducing a vision-language-model (VLM)–based high-level planner that reasons about partial occupancy maps. By projecting a 3D occupancy grid into a 2D map and generating frontier-based subgoals, the VLM scores options and guides a traditional trajectory planner (DYNUS) to avoid detours into small rooms. Key contributions include frontier clustering for subgoal generation, median-based aggregation with MAD for robustness, and empirically demonstrated ~ten percent reductions in path length in Gazebo simulations. The approach requires modest prior knowledge and integrates cleanly with existing autonomy stacks, offering improved routing under partial observability, though model latency remains a practical bottleneck and future work could add region-level semantics for richer planning.

Abstract

We present a novel high-level planning framework that leverages vision-language models (VLMs) to improve autonomous navigation in unknown indoor environments with many dead ends. Traditional exploration methods often take inefficient routes due to limited global reasoning and reliance on local heuristics. In contrast, our approach enables a VLM to reason directly about occupancy maps in a zero-shot manner, selecting subgoals that are likely to yield more efficient paths. At each planning step, we convert a 3D occupancy grid into a partial 2D map of the environment, and generate candidate subgoals. Each subgoal is then evaluated and ranked against other candidates by the model. We integrate this planning scheme into DYNUS \cite{kondo2025dynus}, a state-of-the-art trajectory planner, and demonstrate improved navigation efficiency in simulation. The VLM infers structural patterns (e.g., rooms, corridors) from incomplete maps and balances the need to make progress toward a goal against the risk of entering unknown space. This reduces common greedy failures (e.g., detouring into small rooms) and achieves about 10\% shorter paths on average.

Efficient Navigation in Unknown Indoor Environments with Vision-Language Models

TL;DR

This work tackles efficient navigation in unknown indoor environments with many dead ends by introducing a vision-language-model (VLM)–based high-level planner that reasons about partial occupancy maps. By projecting a 3D occupancy grid into a 2D map and generating frontier-based subgoals, the VLM scores options and guides a traditional trajectory planner (DYNUS) to avoid detours into small rooms. Key contributions include frontier clustering for subgoal generation, median-based aggregation with MAD for robustness, and empirically demonstrated ~ten percent reductions in path length in Gazebo simulations. The approach requires modest prior knowledge and integrates cleanly with existing autonomy stacks, offering improved routing under partial observability, though model latency remains a practical bottleneck and future work could add region-level semantics for richer planning.

Abstract

We present a novel high-level planning framework that leverages vision-language models (VLMs) to improve autonomous navigation in unknown indoor environments with many dead ends. Traditional exploration methods often take inefficient routes due to limited global reasoning and reliance on local heuristics. In contrast, our approach enables a VLM to reason directly about occupancy maps in a zero-shot manner, selecting subgoals that are likely to yield more efficient paths. At each planning step, we convert a 3D occupancy grid into a partial 2D map of the environment, and generate candidate subgoals. Each subgoal is then evaluated and ranked against other candidates by the model. We integrate this planning scheme into DYNUS \cite{kondo2025dynus}, a state-of-the-art trajectory planner, and demonstrate improved navigation efficiency in simulation. The VLM infers structural patterns (e.g., rooms, corridors) from incomplete maps and balances the need to make progress toward a goal against the risk of entering unknown space. This reduces common greedy failures (e.g., detouring into small rooms) and achieves about 10\% shorter paths on average.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the main steps of the planning workflow.
  • Figure 2: Image-based representation of the environment. The robot location is marked in red, the goal in green, candidate subgoals in yellow, free space in light gray, unknown space in dark gray, and occupied space in black.
  • Figure 3: Comparison of 15 paths produced by DYNUS kondo2025dynus with and without our proposed high-level planner in a Gazebo office environment. The start is at the bottom-left and the goal is at the top-right. Path color encodes speed (warmer colors indicate higher speed). The maximum velocity is set to 1.0 ms. Fig. \ref{['fig:dynus_office']} shows that DYNUS alone often enters small rooms and backtracks when the goal is still far, leading to inefficient paths. Fig. \ref{['fig:vlm_office']} shows that our method stays in hallways and avoids small rooms when far from the goal, yielding shorter paths.
  • Figure S1: View of grid-world, a simulation tool we built to study the navigation capabilities of LLMs/VLMs. The 2D map on the left shows the robot (red) traversing a corridor in an effort to reach the final goal (green) in the top right. The squares in the middle and on the right display the normalized median belief and the median absolute deviation (MAD) of the belief, respectively, for each candidate subgoal the robot can move to. The implementation of grid-world is also included in the provided code.