Table of Contents
Fetching ...

NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation

Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, Hui Cheng

TL;DR

NaviDiffusor presents a hybrid approach that blends classical cost constraints with a conditional diffusion model trained on path-RGB pairs for visual navigation. During inference, differentiable task-level and scene-specific costs guide the diffusion sampling, enabling the generation of multimodal, constraint-satisfying paths without retraining. The method demonstrates strong zero-shot generalization across indoor/outdoor, simulated/real-world scenarios, outperforming baselines in collision avoidance and success rate, and it includes a path-selection mechanism to ensure temporal consistency. Practical deployment is supported by RGB-only sensing, monocular depth estimation for collision costs, and a plug-and-play inference pipeline that leverages diffusion priors. The work highlights a scalable route to integrate explicit geometric constraints within learning-based planning for robust robotic navigation.

Abstract

Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation

TL;DR

NaviDiffusor presents a hybrid approach that blends classical cost constraints with a conditional diffusion model trained on path-RGB pairs for visual navigation. During inference, differentiable task-level and scene-specific costs guide the diffusion sampling, enabling the generation of multimodal, constraint-satisfying paths without retraining. The method demonstrates strong zero-shot generalization across indoor/outdoor, simulated/real-world scenarios, outperforming baselines in collision avoidance and success rate, and it includes a path-selection mechanism to ensure temporal consistency. Practical deployment is supported by RGB-only sensing, monocular depth estimation for collision costs, and a plug-and-play inference pipeline that leverages diffusion priors. The work highlights a scalable route to integrate explicit geometric constraints within learning-based planning for robust robotic navigation.

Abstract

Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: The robot needs to navigate to destinations (i.e. image goal or point goal) based on given RGB observations. We incorporate collision and goal cost guidance to improve local path generation.
  • Figure 2: Pipeline overview: RGB observations and the image goal are processed through two encoders, $\Psi_\mathcal{O}$ and $\Psi_\mathcal{G}$, then fed to transformer, serving as a condition for the diffusion model. The gradient of designed cost function $\nabla \mathcal{F}$ is incorporated at each denoising step to guide the local path generation. For long-horizon navigation, a high-level policy, such as a topological map, is used to provide subgoals, supporting both image and point goals.
  • Figure 3: Example estimated depth and its local TSDF cost map generated from RGB observation in the Stanford 2D-3D-S environment.
  • Figure 4: Effect of different guide scale: The guidance scale increases from left to right, we sample 50 paths with guidance (red) and 50 paths without guidance (blue) for each scale.
  • Figure 5: Qualitative Path Comparison between the proposed NaviDiffusor (Red) and baseline method NoMaD (Blue) in 2D-3D-S and Citysim Environments under Basic and Extra Obstacles Settings. Our method avoids extra obstacles that are not present in the topological map, while the baseline method fails.
  • ...and 1 more figures