Table of Contents
Fetching ...

Floor Plan-Guided Visual Navigation Incorporating Depth and Directional Cues

Wei Huang, Jiaxin Li, Zang Wan, Huijun Di, Wei Liang, Zhu Yang

TL;DR

A novel diffusion-based policy, denoted as GlocDiff, which integrates global path planning from the floor plan with local depth-aware features derived from RGB observations, collectively enabling precise prediction of optimal navigation directions and robust obstacle avoidance.

Abstract

Guiding an agent to a specific target in indoor environments based solely on RGB inputs and a floor plan is a promising yet challenging problem. Although existing methods have made significant progress, two challenges remain unresolved. First, the modality gap between egocentric RGB observations and the floor plan hinders the integration of visual and spatial information for both local obstacle avoidance and global planning. Second, accurate localization is critical for navigation performance, but remains challenging at deployment in unseen environments due to the lack of explicit geometric alignment between RGB inputs and floor plans. We propose a novel diffusion-based policy, denoted as GlocDiff, which integrates global path planning from the floor plan with local depth-aware features derived from RGB observations. The floor plan offers explicit global guidance, while the depth features provide implicit geometric cues, collectively enabling precise prediction of optimal navigation directions and robust obstacle avoidance. Moreover, GlocDiff introduces noise perturbation during training to enhance robustness against pose estimation errors, and we find that combining this with a relatively stable VO module during inference results in significantly improved navigation performance. Extensive experiments on the FloNa benchmark demonstrate GlocDiff's efficiency and effectiveness in achieving superior navigation performance, and the success of real-world deployments also highlights its potential for widespread practical applications.

Floor Plan-Guided Visual Navigation Incorporating Depth and Directional Cues

TL;DR

A novel diffusion-based policy, denoted as GlocDiff, which integrates global path planning from the floor plan with local depth-aware features derived from RGB observations, collectively enabling precise prediction of optimal navigation directions and robust obstacle avoidance.

Abstract

Guiding an agent to a specific target in indoor environments based solely on RGB inputs and a floor plan is a promising yet challenging problem. Although existing methods have made significant progress, two challenges remain unresolved. First, the modality gap between egocentric RGB observations and the floor plan hinders the integration of visual and spatial information for both local obstacle avoidance and global planning. Second, accurate localization is critical for navigation performance, but remains challenging at deployment in unseen environments due to the lack of explicit geometric alignment between RGB inputs and floor plans. We propose a novel diffusion-based policy, denoted as GlocDiff, which integrates global path planning from the floor plan with local depth-aware features derived from RGB observations. The floor plan offers explicit global guidance, while the depth features provide implicit geometric cues, collectively enabling precise prediction of optimal navigation directions and robust obstacle avoidance. Moreover, GlocDiff introduces noise perturbation during training to enhance robustness against pose estimation errors, and we find that combining this with a relatively stable VO module during inference results in significantly improved navigation performance. Extensive experiments on the FloNa benchmark demonstrate GlocDiff's efficiency and effectiveness in achieving superior navigation performance, and the success of real-world deployments also highlights its potential for widespread practical applications.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: At each time step, our method extracts the depth cue from the observation frames and derives the directional cue by planning the shortest path to the goal on the floor plan. Both complementary cues facilitate the learning of a navigation policy that integrates obstacle avoidance with efficient goal-reaching behavior.
  • Figure 2: Overview of GlocDiff. The depth encoder uses RGB observations to produce depth latent features, which are fed into a multi-head attention module, yielding the depth context. Taking the depth context, the current pose from the localization module, and the floor plan feature from the floor plan encoder as inputs, the MLP outputs the depth cue. The directional cue is obtained by computing the shortest path to the goal using the A$^*$ algorithm. Conditioned on depth and directional cues, the diffusion-based policy generates future actions.
  • Figure 3: Conditional UNet. The directional cue and the time step are fed into the first down module and the second up module, while the depth cue and the time step are fed into each module.
  • Figure 4: Trajectory visualization. The visualization of trajectories for four baselines across six episodes.
  • Figure 5: Performance of different baselines on the test episodes in Mobridge (left) and Spotswodd (right). The left column illustrates the traversed trajectory, while the right column showcases the diverse actions generated by different models based on the same observation.
  • ...and 2 more figures