Table of Contents
Fetching ...

NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

Li Cui, Yang Ding, Richard Hartley, Zirui Xie, Laurent Kneip, Zhenghua Yu

TL;DR

NF-SLAM tackles the challenge of robust, vision-only object-level SLAM for automotive scenarios by augmenting implicit neural shape representations with a normalizing-flow prior, allowing a compact $16$-D latent space to reliably recover vehicle shapes from sparse measurements. The method seamlessly integrates ROI-based front-end enrichment with a back-end optimization that jointly minimizes an SDF surface loss, a silhouette-consistency loss, and a rendering-depth loss, across multiple frames. Experimental results on synthetic ShapeNet data and real KITTI sequences show NF-SLAM achieving competitive shape reconstruction quality and map completeness compared to lidar-enabled baselines, with strong robustness to partial observations. The approach holds promise for practical ADAS applications by delivering accurate object-level maps using stereo vision alone, while maintaining stable performance even under limited depth information.

Abstract

We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.

NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

TL;DR

NF-SLAM tackles the challenge of robust, vision-only object-level SLAM for automotive scenarios by augmenting implicit neural shape representations with a normalizing-flow prior, allowing a compact -D latent space to reliably recover vehicle shapes from sparse measurements. The method seamlessly integrates ROI-based front-end enrichment with a back-end optimization that jointly minimizes an SDF surface loss, a silhouette-consistency loss, and a rendering-depth loss, across multiple frames. Experimental results on synthetic ShapeNet data and real KITTI sequences show NF-SLAM achieving competitive shape reconstruction quality and map completeness compared to lidar-enabled baselines, with strong robustness to partial observations. The approach holds promise for practical ADAS applications by delivering accurate object-level maps using stereo vision alone, while maintaining stable performance even under limited depth information.

Abstract

We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.

Paper Structure

This paper contains 22 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our proposed framework for object-level visual SLAM in automotive applications.
  • Figure 2: Implicit neural shape representation used in the present work. The architecture is composed of a DeepSDF decoder preceded by a normalizing flow network. Given an input latent code $\mathbf{w}$ and a sampling point $\{x,y,z\}$, the network generates the 3D Euclidean distance between the sampled point and the object surface.
  • Figure 3: Shape optimization results for complete point clouds taken from ShapeNet. From left to right: Original model, point cloud samples, our proposed generator with normalizing flow optimized with Gauss-Newton, the same generator optimized with Adam, DeepSDF results using Gauss-Newton, and DeepSDF results using Adam optimizer.
  • Figure 4: Shape optimization results for partial point clouds taken from ShapeNet. From left to right: Original model, point cloud samples, our proposed generator with normalizing flow optimized with Gauss-Newton, the same generator optimized with Adam, DeepSDF results using Gauss-Newton, and DeepSDF results using Adam optimizer.
  • Figure 5: Qualitative results obtained on KITTI dataset. In subfigure (b), DSP-SLAM is positioned above, DSP-SLAM* is positioned below. In subfigure (c), NF-SLAM is positioned above, NF-SLAM* is positioned below.