Table of Contents
Fetching ...

Q-SLAM: Quadric Representations for Monocular SLAM

Chensheng Peng, Chenfeng Xu, Yue Wang, Mingyu Ding, Heng Yang, Masayoshi Tomizuka, Kurt Keutzer, Marco Pavone, Wei Zhan

TL;DR

This paper argues that rigid scene components can be effectively decomposed into quadric surfaces, and uses the quadric assumption to rectify noisy depth estimations from RGB inputs, which significantly improves depth estimation accuracy.

Abstract

In this paper, we reimagine volumetric representations through the lens of quadrics. We posit that rigid scene components can be effectively decomposed into quadric surfaces. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which results in more accurate and efficient modeling of 3D scenes in SLAM contexts. First, we use the quadric assumption to rectify noisy depth estimations from RGB inputs. This step significantly improves depth estimation accuracy, and allows us to efficiently sample ray points around quadric planes instead of the entire volume space in previous NeRF-SLAM systems. Second, we introduce a novel quadric-decomposed transformer to aggregate information across quadrics. The quadric semantics are not only explicitly used for depth correction and scene decomposition, but also serve as an implicit supervision signal for the mapping network. Through rigorous experimental evaluation, our method exhibits superior performance over other approaches relying on estimated depth, and achieves comparable accuracy to methods utilizing ground truth depth on both synthetic and real-world datasets.

Q-SLAM: Quadric Representations for Monocular SLAM

TL;DR

This paper argues that rigid scene components can be effectively decomposed into quadric surfaces, and uses the quadric assumption to rectify noisy depth estimations from RGB inputs, which significantly improves depth estimation accuracy.

Abstract

In this paper, we reimagine volumetric representations through the lens of quadrics. We posit that rigid scene components can be effectively decomposed into quadric surfaces. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which results in more accurate and efficient modeling of 3D scenes in SLAM contexts. First, we use the quadric assumption to rectify noisy depth estimations from RGB inputs. This step significantly improves depth estimation accuracy, and allows us to efficiently sample ray points around quadric planes instead of the entire volume space in previous NeRF-SLAM systems. Second, we introduce a novel quadric-decomposed transformer to aggregate information across quadrics. The quadric semantics are not only explicitly used for depth correction and scene decomposition, but also serve as an implicit supervision signal for the mapping network. Through rigorous experimental evaluation, our method exhibits superior performance over other approaches relying on estimated depth, and achieves comparable accuracy to methods utilizing ground truth depth on both synthetic and real-world datasets.
Paper Structure (29 sections, 15 equations, 7 figures, 7 tables)

This paper contains 29 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of our proposed method. From the input RGB sequences, we can predict depth map, camera pose and segmentation mask. Subsequently, the initially estimated depth undergoes correction based on the quadric assumption. Along with the segmentation mask, camera poses, and image frames, the corrected depth are used for optimization of NeRF network. During the 3D reconstruction process, our proposed quadric ray transformer leverages the quadric priors effectively.
  • Figure 2: Quadric depth correction
  • Figure 3: The detailed structure of quadric ray transformer.
  • Figure 4: Qualitative reconstruction results on Replica dataset. We compare our solution with recent SOTA SLAM systems Co-SLAM wang2023co and GO-SLAM zhang2023go. Our method can recover better texture features, especially on the boundary of instances.
  • Figure 5: Structure overview of Q-SLAM. 1) Tracking: initialize per-frame camera poses and depth prediction. Correct the noisy depth using our proposed depth correction module based on the segmentation results from monocular inputs. 2) NeRF: using the selected keyframes to supervise the optimization of NeRF network equipped with our proposed quadric-decomposed transformer. 3) Mapping: global bundle adjustment to jointly optimize the scene representation and camera poses taking rays sampled from all keyframes. Reconstruct the complete scene by fusing the rendered RGB images and depth maps with TSDF-fusion zeng20163dmatch.
  • ...and 2 more figures