Table of Contents
Fetching ...

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Heng Li, Yifan Duan, Xinran Zhang, Haiyi Liu, Jianmin Ji, Yanyong Zhang

TL;DR

OCC-VO is introduced, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations.

Abstract

Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: https://github.com/USTCLH/OCC-VO.

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

TL;DR

OCC-VO is introduced, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations.

Abstract

Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: https://github.com/USTCLH/OCC-VO.
Paper Structure (19 sections, 5 equations, 3 figures, 5 tables)

This paper contains 19 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Our approach transforms surround view cameras' image sequence into trajectories and global semantic maps. Such transformation can enhance scene understanding for downstream tasks in challenging environments.
  • Figure 2: Pipeline of our proposed OCC-VO.
  • Figure 3: Three filters we propose. The rectangular boxes represent 3D semantic occupancy and the global semantic map separately, with each circle representing a specific point. The color of the circle represents semantic labels, and the dashed circle border indicates that the point is transient with low p-Index defined in Sec. \ref{['Voxel_PFilter']}. (a) shows three poor point-pair matches; (b) removes the match with different labels; (c) eliminates the one from dynamic objects; and (d) filters out the one containing a low p-Index value.