Table of Contents
Fetching ...

Deep Bayesian Future Fusion for Self-Supervised, High-Resolution, Off-Road Mapping

Shubhra Aich, Wenshan Wang, Parv Maheshwari, Matthew Sivaprakasam, Samuel Triest, Cherie Ho, Jason M. Gregory, John G. Rogers, Sebastian Scherer

TL;DR

This work tackles the challenge of high-resolution off-road mapping under long-range sparsity and sensing noise by introducing Deep Bayesian Future Fusion (DBFF), a dense BEV map completion framework that operates at 2 cm resolution over a 30 m forward view. It marries a Bayes-filter-inspired fusion mechanism with a CNN/RNN backbone and perceptual generative losses to predict dense RGB and height maps from sparse measurements, while using a proximal-distal latent split and cross-attention-based measurement updates to maintain geometric consistency. The authors fabricate a self-supervised training regime, called Future Fusion, to generate large-scale dense ground-truth BEV maps from stereo, RGB, and LiDAR data, and demonstrate improvements over baselines in both direct map quality (MAE, FID, SSIM) and downstream costmap prediction. The approach achieves real-time performance (~16 Hz) and shows that learned features from the completed maps carry meaningful terrain information, suggesting significant potential for robust, pretrainable dense mapping in autonomous off-road navigation.

Abstract

High-speed off-road navigation requires long-range, high-resolution maps to enable robots to safely navigate over different surfaces while avoiding dangerous obstacles. However, due to limited computational power and sensing noise, most approaches to off-road mapping focus on producing coarse (20-40cm) maps of the environment. In this paper, we propose Future Fusion, a framework capable of generating dense, high-resolution maps from sparse sensing data (30m forward at 2cm). This is accomplished by - (1) the efficient realization of the well-known Bayes filtering within the standard deep learning models that explicitly accounts for the sparsity pattern in stereo and LiDAR depth data, and (2) leveraging perceptual losses common in generative image completion. The proposed methodology outperforms the conventional baselines. Moreover, the learned features and the completed dense maps lead to improvements in the downstream navigation task.

Deep Bayesian Future Fusion for Self-Supervised, High-Resolution, Off-Road Mapping

TL;DR

This work tackles the challenge of high-resolution off-road mapping under long-range sparsity and sensing noise by introducing Deep Bayesian Future Fusion (DBFF), a dense BEV map completion framework that operates at 2 cm resolution over a 30 m forward view. It marries a Bayes-filter-inspired fusion mechanism with a CNN/RNN backbone and perceptual generative losses to predict dense RGB and height maps from sparse measurements, while using a proximal-distal latent split and cross-attention-based measurement updates to maintain geometric consistency. The authors fabricate a self-supervised training regime, called Future Fusion, to generate large-scale dense ground-truth BEV maps from stereo, RGB, and LiDAR data, and demonstrate improvements over baselines in both direct map quality (MAE, FID, SSIM) and downstream costmap prediction. The approach achieves real-time performance (~16 Hz) and shows that learned features from the completed maps carry meaningful terrain information, suggesting significant potential for robust, pretrainable dense mapping in autonomous off-road navigation.

Abstract

High-speed off-road navigation requires long-range, high-resolution maps to enable robots to safely navigate over different surfaces while avoiding dangerous obstacles. However, due to limited computational power and sensing noise, most approaches to off-road mapping focus on producing coarse (20-40cm) maps of the environment. In this paper, we propose Future Fusion, a framework capable of generating dense, high-resolution maps from sparse sensing data (30m forward at 2cm). This is accomplished by - (1) the efficient realization of the well-known Bayes filtering within the standard deep learning models that explicitly accounts for the sparsity pattern in stereo and LiDAR depth data, and (2) leveraging perceptual losses common in generative image completion. The proposed methodology outperforms the conventional baselines. Moreover, the learned features and the completed dense maps lead to improvements in the downstream navigation task.
Paper Structure (13 sections, 5 equations, 8 figures, 2 tables)

This paper contains 13 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: RGB (top) and height (bottom) Bird's eye view (BEV) maps (cropped at $12m \times 30m$) with $2cm$ pixel resolution from the sparse input (left), future-fusion label (right), and prediction from our Bayesian UNet/e2 structure trained with these input/label pairs (middle column). Best viewed in digital format. The images are downsampled to comply with the size limit.
  • Figure 2: Sample BEV images (cropped) showing the same region with different pixel resolutions -- $2$cm used in this work and $20$cm which is the finest in recent literature roadrunner. The difference in the perceptual quality for a typical off-road terrain is evident.
  • Figure 3: Block diagram of the data generation protocol via future-fusion. The odometry estimation provided by the Super Odometry super-odom framework is employed to register the colorized stereo point cloud tartan-vo and LiDAR scans to generate the trajectory RGB and height maps. These dense maps containing billions of points are then utilized to generate dense labels corresponding to the sparse local BEV maps for self-supervision. See Section \ref{['subsec:data-gen']} for details.
  • Figure 4: Comparison of different pixel attibution strategies for BEV map generation -- unweighted average (Avg); inverse-distance weighted average (IDW-Avg); closest point attribution (Closest). The distinctive perceptual quality for the later strategy is evident. Best viewed in digital format.
  • Figure 5: Schematics of the deep Bayesian fusion mechanism. The input is split into two disjoint sub-regions: (1) smaller, proximal, and reliable one resembling the combined previous state $\mathbf{s}_{t-1}$ and control action $\mathbf{a}_t$, and (2) distal, sparse and noisy one equivalent to the measurement $\mathbf{z}_t$. (Prediction step) The distal latent is predicted by unrolling the proximal latent via RNN. (Measurement update) The predicted roll out is modulated by the noisy distal latent via cross-attention mechanism. Finally, the fused latent is decoded into the complete local BEV map representing the current state $\mathbf{s}_t$.
  • ...and 3 more figures