Table of Contents
Fetching ...

MonoMPC: Monocular Vision Based Navigation with Learned Collision Model and Risk-Aware Model Predictive Control

Basant Sharma, Prajyot Jadhav, Pranjal Paul, K. Madhava Krishna, Arun Kumar Singh

TL;DR

This work tackles monocular navigation in clutter by moving beyond noisy depth-based collision checks to a depth-conditioned probabilistic collision model that predicts a distribution over obstacle clearance for a given trajectory. The model feeds a risk-aware Model Predictive Control (MPC) framework, with a novel risk metric based on Maximum Mean Discrepancy (MMD) that compares the predicted clearance distribution to a feasible boundary via a chance-constraint formulation. A task-aware training pipeline jointly optimizes the collision model and risk estimator using safe and unsafe trajectories to calibrate uncertainty, resulting in better-calibrated predictions and safer, faster navigation in real-world clutter. The approach demonstrates real-time performance, robust improvements over ROSNAV, MonoNav, and NoMaD, and strong potential for deployment on mobile platforms; future work includes temporal memory and dynamic environments.

Abstract

Navigating unknown environments with a single RGB camera is challenging, as the lack of depth information prevents reliable collision-checking. While some methods use estimated depth to build collision maps, we found that depth estimates from vision foundation models are too noisy for zero-shot navigation in cluttered environments. We propose an alternative approach: instead of using noisy estimated depth for direct collision-checking, we use it as a rich context input to a learned collision model. This model predicts the distribution of minimum obstacle clearance that the robot can expect for a given control sequence. At inference, these predictions inform a risk-aware MPC planner that minimizes estimated collision risk. We proposed a joint learning pipeline that co-trains the collision model and risk metric using both safe and unsafe trajectories. Crucially, our joint-training ensures well calibrated uncertainty in our collision model that improves navigation in highly cluttered environments. Consequently, real-world experiments show reductions in collision-rate and improvements in goal reaching and speed over several strong baselines.

MonoMPC: Monocular Vision Based Navigation with Learned Collision Model and Risk-Aware Model Predictive Control

TL;DR

This work tackles monocular navigation in clutter by moving beyond noisy depth-based collision checks to a depth-conditioned probabilistic collision model that predicts a distribution over obstacle clearance for a given trajectory. The model feeds a risk-aware Model Predictive Control (MPC) framework, with a novel risk metric based on Maximum Mean Discrepancy (MMD) that compares the predicted clearance distribution to a feasible boundary via a chance-constraint formulation. A task-aware training pipeline jointly optimizes the collision model and risk estimator using safe and unsafe trajectories to calibrate uncertainty, resulting in better-calibrated predictions and safer, faster navigation in real-world clutter. The approach demonstrates real-time performance, robust improvements over ROSNAV, MonoNav, and NoMaD, and strong potential for deployment on mobile platforms; future work includes temporal memory and dynamic environments.

Abstract

Navigating unknown environments with a single RGB camera is challenging, as the lack of depth information prevents reliable collision-checking. While some methods use estimated depth to build collision maps, we found that depth estimates from vision foundation models are too noisy for zero-shot navigation in cluttered environments. We propose an alternative approach: instead of using noisy estimated depth for direct collision-checking, we use it as a rich context input to a learned collision model. This model predicts the distribution of minimum obstacle clearance that the robot can expect for a given control sequence. At inference, these predictions inform a risk-aware MPC planner that minimizes estimated collision risk. We proposed a joint learning pipeline that co-trains the collision model and risk metric using both safe and unsafe trajectories. Crucially, our joint-training ensures well calibrated uncertainty in our collision model that improves navigation in highly cluttered environments. Consequently, real-world experiments show reductions in collision-rate and improvements in goal reaching and speed over several strong baselines.

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Monocular navigation in cluttered environments using ROSNAV (top) vs. our approach (bottom). ROSNAV constructs cost maps directly from the estimated point cloud (green) generated by DepthAnythingNEURIPS2024_26cfdcd8, which deviates significantly from the ground-truth (red), leading to incorrect free-space detection (e.g., top row, panel 3) and collisions. In contrast, our method treats the estimated point cloud as a conditioning input to a learned probabilistic collision model, integrated with a risk-aware MPC framework. Snapshots across time steps are shown for both methods (corresponding time indices are labeled).
  • Figure 2: Overview of baseline learning pipeline for our probabilistic collision model that predicts worst-case obstacle clearance along a trajectory. Given an RGB image and control sequence, we extract geometric features from the estimated point cloud using a pre-trained depth estimator and PointNet++. Combined with the initial robot state, these form the observation vector, which an MLP uses to predict the mean and variance of obstacle clearance. The learnable components (yellow blocks) are trained end-to-end using Gaussian negative log-likelihood loss.
  • Figure 3: Overview of our task-aware learning of probabilistic collision model. The previous baseline model of Fig. \ref{['fig:vanilla_nll_pipeline']} had a weak supervision on predicted variance due to the absence of ground-truth uncertainty, often resulting in over- or underconfident predictions. To address this, we introduce downstream supervision via collision risk estimation. The observation vector and control sequence are passed through $\text{MLP}_\theta$ to predict the mean, variance, and a kernel parameter. Using the reparameterization trick, we generate obstacle clearance samples to compute constraint violations, forming an MMD-based risk representation. This is processed by $\text{MLP}_\phi$, followed by a softmax layer. The learnable parts shown in yellow are trained end-to-end with Gaussian NLL and a cross-entropy loss.
  • Figure 4: Box plot of minimum residuals between sampled worst-case obstacle clearances and ground truth. The augmented model achieves lower median error and reduced variability compared to the baseline, indicating sharper and more consistent alignment with ground-truth worst-case obstacle clearances.
  • Figure 5: Comparison between our approach (blue) and MonoNav simon2023mononav (red); goal in green. The noise in the estimated depth translates to erroneous 3D occupancy maps-the yellow cuboids (ground-truth obstacles) do not align with their reconstructed point clouds resulting in MonoNav getting stuck or colliding (b-c). In contrast, our approach is able to avoid the yellow cuboids based on the noisy estimated depth. We do not use MonoNav’s 3D reconstruction; trajectories are overlaid solely for visualization and comparison.