Table of Contents
Fetching ...

FloNa: Floor Plan Guided Embodied Visual Navigation

Jiaxin Li, Weiqi Huang, Zan Wang, Wei Liang, Huijun Di, Feng Liu

TL;DR

FloNa introduces a floor-plan guided embodied visual navigation task and proposes FloDiff, a diffusion-policy framework with an explicit localization module to align RGB observations with a floor plan. The method employs a transformer-based fusion of observed RGB sequences and floor-plan embeddings, with two variants (Naive-FloDiff and Loc-FloDiff) for handling agent pose information, and is trained with targeted losses to jointly optimize diffusion action prediction and pose/distance-to-goal estimation. A large iGibson-based dataset of $20{,}214$ episodes across $117$ scenes is curated to benchmark performance, totaling approximately $3.31$ million RGB images; evaluation shows Loc-FloDiff, especially with ground-truth pose, achieves superior SR and SPL and demonstrates robustness to localization noise. Real-world deployment on an AGV without fine-tuning further demonstrates robustness and practical potential in unseen environments. Overall, the study highlights the viability and benefits of incorporating floor-plan priors into embodied navigation and sets a foundation for future multi-modal, localization-aware planning in indoor settings.

Abstract

Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect $20k$ navigation episodes across $117$ scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Project website: https://gauleejx.github.io/flona/.

FloNa: Floor Plan Guided Embodied Visual Navigation

TL;DR

FloNa introduces a floor-plan guided embodied visual navigation task and proposes FloDiff, a diffusion-policy framework with an explicit localization module to align RGB observations with a floor plan. The method employs a transformer-based fusion of observed RGB sequences and floor-plan embeddings, with two variants (Naive-FloDiff and Loc-FloDiff) for handling agent pose information, and is trained with targeted losses to jointly optimize diffusion action prediction and pose/distance-to-goal estimation. A large iGibson-based dataset of episodes across scenes is curated to benchmark performance, totaling approximately million RGB images; evaluation shows Loc-FloDiff, especially with ground-truth pose, achieves superior SR and SPL and demonstrates robustness to localization noise. Real-world deployment on an AGV without fine-tuning further demonstrates robustness and practical potential in unseen environments. Overall, the study highlights the viability and benefits of incorporating floor-plan priors into embodied navigation and sets a foundation for future multi-modal, localization-aware planning in indoor settings.

Abstract

Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect navigation episodes across scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Project website: https://gauleejx.github.io/flona/.

Paper Structure

This paper contains 40 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: flona: Given a floor plan with a marked goal indicated by the red dot, the agent's task is to navigate to the corresponding target location in the environment using RGB observations. To tackle this task, we propose FloDiff, a novel diffusion policy-based framework that iteratively generates and refines the planned trajectory.
  • Figure 2: Typical scenes, navigable areas, navigation episodes in our collected dataset. We show the short and long episodes using green and blue colors, respectively.
  • Figure 3: Pipeline overview. FloDiff employs an attention module to fuse features from visual observation and floor plan, yielding a context embedding $c_t$. Depending on how the current agent pose is derived, FloDiff has two variants: (1) Naive-FloDiff (below), which learns to predict the current pose $(\hat{p}_t, \hat{r}_t)$ during policy learning; (2) Loc-FloDiff (above), which directly uses the ground truth pose or predictions from pre-trained models. The concatenation of the observation context $c_t$, goal position $p_g$, and current agent pose $(p_t, r_t)$ is then fed into the policy network to generate actions.
  • Figure 4: Robustness of Loc-FloDiff (GT). (a) The blue arrow is the ground truth agent pose, and the yellow circle indicates the noisy poses. (b) Our method can generate diverse collision-free paths (in yellow), even given the noisy poses. The red collision paths are generated by Loc-A* (GT).
  • Figure 5: Agent behavior varies given different goals. (a) The agent starts from the same position but with three different goals. (b) Our approach predicts three distinct paths, each corresponding to a specific goal, as indicated by the respective colors.
  • ...and 3 more figures