Table of Contents
Fetching ...

ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer

Tianye Ding, Hongyu Li, Huaizu Jiang

TL;DR

ODTFormer, a Transformer-based model that addresses both obstacle detection and tracking problems, achieves state-of-the-art performance in the obstacle detection task and reports comparable accuracy to state-of-the-art obstacle tracking models while requiring a fraction of their computation cost.

Abstract

Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.

ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer

TL;DR

ODTFormer, a Transformer-based model that addresses both obstacle detection and tracking problems, achieves state-of-the-art performance in the obstacle detection task and reports comparable accuracy to state-of-the-art obstacle tracking models while requiring a fraction of their computation cost.

Abstract

Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.
Paper Structure (17 sections, 9 equations, 4 figures, 4 tables)

This paper contains 17 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We propose ODTFormer for joint obstacle detection and tracking using stereo cameras. We first detect obstacles in the form of occupancy grids at each time step and match them across two consecutive frames for tracking. We can see here that our model can successfully detect all obstacles and accurately track them. The obstacle detection results are shown as red cubes, and the tracking results are marked as green arrows. Longer arrows indicate large motion magnitude.
  • Figure 2: Illustration of the overall architecture design. Left: For obstacle detection, we first extract multi-scale 2D feature maps tan_efficientnet_2019lin2017feature for each of the stereo images. We then encode the voxels in the ROI to cross-attend to the image features to compute the matching cost through our novel cost volume construction method. Such a cost volume is directly constructed in the 3D space, which conforms better to the scene geometry, disentangles dataset specifics from model design, and thus generalizes well. It is then progressively decoded into occupancy voxel grids. Right: For obstacle tracking, we cast it as a matching problem by finding the correspondences of voxels across two consecutive frames, where we incorporate physical constraints to improve both the accuracy and efficiency. Both the detection and tracking modules can be jointly optimized in an end-to-end manner and run efficiently.
  • Figure 3: Visual results of obstacle detection on the DrivingStereo dataset.
  • Figure 4: Visual results of obstacle tracking on the KITTI 2015 dataset. The first row shows the stacked images from consecutive frames. The obstacle detection results are shown as red cubes, and the tracking results are marked as green arrows. Longer arrows indicate large motion magnitude.