Table of Contents
Fetching ...

VOOM: Robust Visual Object Odometry and Mapping using Hierarchical Landmarks

Yutong Wang, Chaoyang Jiang, Xieyuanli Chen

TL;DR

VOOM tackles robust visual SLAM by combining high-level object landmarks with traditional feature points in a coarse-to-fine, hierarchical framework. Objects are represented as dual quadrics and matched to image observations using a Wasserstein-based observation and data-association scheme, enabling robust pose estimation and map refinement with object cues guiding point associations. The system maintains an object covisibility graph to drive local BA, updating both object parameters and map points, and demonstrates improved localization over ORB-SLAM2 and other object-based SLAM methods on TUM RGB-D and LM-Data datasets, including dynamic scenes. By leveraging semantic object information without introducing object residuals into BA, VOOM offers a principled pathway for semantically enriched SLAM and lays groundwork for future loop closure and relocalization strategies that integrate both objects and points.

Abstract

In recent years, object-oriented simultaneous localization and mapping (SLAM) has attracted increasing attention due to its ability to provide high-level semantic information while maintaining computational efficiency. Some researchers have attempted to enhance localization accuracy by integrating the modeled object residuals into bundle adjustment. However, few have demonstrated better results than feature-based visual SLAM systems, as the generic coarse object models, such as cuboids or ellipsoids, are less accurate than feature points. In this paper, we propose a Visual Object Odometry and Mapping framework VOOM using high-level objects and low-level points as the hierarchical landmarks in a coarse-to-fine manner instead of directly using object residuals in bundle adjustment. Firstly, we introduce an improved observation model and a novel data association method for dual quadrics, employed to represent physical objects. It facilitates the creation of a 3D map that closely reflects reality. Next, we use object information to enhance the data association of feature points and consequently update the map. In the visual object odometry backend, the updated map is employed to further optimize the camera pose and the objects. Meanwhile, local bundle adjustment is performed utilizing the objects and points-based covisibility graphs in our visual object mapping process. Experiments show that VOOM outperforms both object-oriented SLAM and feature points SLAM systems such as ORB-SLAM2 in terms of localization. The implementation of our method is available at https://github.com/yutongwangBIT/VOOM.git.

VOOM: Robust Visual Object Odometry and Mapping using Hierarchical Landmarks

TL;DR

VOOM tackles robust visual SLAM by combining high-level object landmarks with traditional feature points in a coarse-to-fine, hierarchical framework. Objects are represented as dual quadrics and matched to image observations using a Wasserstein-based observation and data-association scheme, enabling robust pose estimation and map refinement with object cues guiding point associations. The system maintains an object covisibility graph to drive local BA, updating both object parameters and map points, and demonstrates improved localization over ORB-SLAM2 and other object-based SLAM methods on TUM RGB-D and LM-Data datasets, including dynamic scenes. By leveraging semantic object information without introducing object residuals into BA, VOOM offers a principled pathway for semantically enriched SLAM and lays groundwork for future loop closure and relocalization strategies that integrate both objects and points.

Abstract

In recent years, object-oriented simultaneous localization and mapping (SLAM) has attracted increasing attention due to its ability to provide high-level semantic information while maintaining computational efficiency. Some researchers have attempted to enhance localization accuracy by integrating the modeled object residuals into bundle adjustment. However, few have demonstrated better results than feature-based visual SLAM systems, as the generic coarse object models, such as cuboids or ellipsoids, are less accurate than feature points. In this paper, we propose a Visual Object Odometry and Mapping framework VOOM using high-level objects and low-level points as the hierarchical landmarks in a coarse-to-fine manner instead of directly using object residuals in bundle adjustment. Firstly, we introduce an improved observation model and a novel data association method for dual quadrics, employed to represent physical objects. It facilitates the creation of a 3D map that closely reflects reality. Next, we use object information to enhance the data association of feature points and consequently update the map. In the visual object odometry backend, the updated map is employed to further optimize the camera pose and the objects. Meanwhile, local bundle adjustment is performed utilizing the objects and points-based covisibility graphs in our visual object mapping process. Experiments show that VOOM outperforms both object-oriented SLAM and feature points SLAM systems such as ORB-SLAM2 in terms of localization. The implementation of our method is available at https://github.com/yutongwangBIT/VOOM.git.
Paper Structure (16 sections, 5 equations, 5 figures, 4 tables)

This paper contains 16 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Top: Groundtruth map. Middle: ORB SLAM2 with loop closure disabled. Bottom: VOOM. The colored ellipsoids represent the 3D objects, blue cones illustrate the camera poses, and green lines represent covisibility relationships between keyframes. The map generated by ORB-SLAM2 has redundant point clouds and blurry boundaries, while our VOOM builds a more lightweight yet semantic enhanced map, resulting in more accurate camera pose estimation.
  • Figure 2: The proposed VOOM framework uses both low-level feature-based map points and high-level objects as landmarks. The color yellow indicates the process has only feature points participated, the color mint reveals that it is a pure object-based process, and the color blue means both landmarks contribute to the procedure.
  • Figure 3: Illustration of associating map points based on the associated objects.
  • Figure 4: Illustration of covisiblity graphs.
  • Figure 5: The qualitative mapping results on the Diamond Ground Sequence. The obvious distorted parts of the ORB-SLAM2 map are circled with red color.