Table of Contents
Fetching ...

Transformer-based stereo-aware 3D object detection from binocular images

Hanqing Sun, Yanwei Pang, Jiale Cao, Jin Xie, Xuelong Li

TL;DR

This paper presents TS3D, a Transformer-based Stereo-aware 3D object detector, and proposes a Stereo Preserving Feature Pyramid Network (SPFPN), designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features.

Abstract

Transformers have shown promising progress in various visual object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. More importantly, the attention mechanism in the Transformer model and the 3D information extraction in binocular stereo are both similarity-based. However, directly applying existing Transformer-based detectors to binocular stereo 3D object detection leads to slow convergence and significant precision drops. We argue that a key cause of that defect is that existing Transformers ignore the binocular-stereo-specific image correspondence information. In this paper, we explore the model design of Transformers in binocular 3D object detection, focusing particularly on extracting and encoding task-specific image correspondence information. To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. In the TS3D, a Disparity-Aware Positional Encoding (DAPE) module is proposed to embed the image correspondence information into stereo features. The correspondence is encoded as normalized sub-pixel-level disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the 3D location information of the scene. To enrich multi-scale stereo features, we propose a Stereo Preserving Feature Pyramid Network (SPFPN). The SPFPN is designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair. It is competitive with advanced counterparts in terms of both precision and inference speed.

Transformer-based stereo-aware 3D object detection from binocular images

TL;DR

This paper presents TS3D, a Transformer-based Stereo-aware 3D object detector, and proposes a Stereo Preserving Feature Pyramid Network (SPFPN), designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features.

Abstract

Transformers have shown promising progress in various visual object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. More importantly, the attention mechanism in the Transformer model and the 3D information extraction in binocular stereo are both similarity-based. However, directly applying existing Transformer-based detectors to binocular stereo 3D object detection leads to slow convergence and significant precision drops. We argue that a key cause of that defect is that existing Transformers ignore the binocular-stereo-specific image correspondence information. In this paper, we explore the model design of Transformers in binocular 3D object detection, focusing particularly on extracting and encoding task-specific image correspondence information. To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. In the TS3D, a Disparity-Aware Positional Encoding (DAPE) module is proposed to embed the image correspondence information into stereo features. The correspondence is encoded as normalized sub-pixel-level disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the 3D location information of the scene. To enrich multi-scale stereo features, we propose a Stereo Preserving Feature Pyramid Network (SPFPN). The SPFPN is designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair. It is competitive with advanced counterparts in terms of both precision and inference speed.
Paper Structure (18 sections, 13 equations, 5 figures, 5 tables)

This paper contains 18 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We adapt Transformer-based surround-view 3D object detectors DETR3D Wang_DETR3D3DObject_2021, PETR Liu_PETRPositionEmbedding_2022, and BEVFormer Li_BEVFormerLearningBird_2022 to binocular detection. The above Transformer-based detectors are all trained on the KITTI training subset for 320 epochs. The Moderate validation AP of DETR3D-Binocular during training is plotted in (a) and the 3D detection APs are listed in (b). Existing Transformer detectors converge to a poor 3D detection AP, whereas the Transformer-based TS3D can be trained to a superior performance.
  • Figure 2: The overall architecture of the Transformer-based Stereo-aware 3D object detector (TS3D). Sequentially, TS3D takes binocular images as inputs (blue boxes in the figures denote left view, green denotes right), extracts unary features, extracts stereo features using SPFPN (Stereo Preserving Feature Pyramid Network), estimates disparities, decodes object features using a multi-scale deformable DETR decoder Zhu_DeformableDETRDeformable_2021, and regresses and classifies 3D objects. The DAPE (Disparity-Aware Positional Encoding) elaborated on the right is used to explicitly encode image correspondence information for detection.
  • Figure 3: Comparing (a) FPN lin_feature_2017 and (b) BiFPN tan_efficientdet_2020 with the proposed (c) Stereo Preserving Feature Pyramid Network (SPFPN). FPN consists of a top-down path and BiFPN introduces an additional bottom-up path. Our SPFPN utilizes the FPN to extract multi-scale unary features, and a six-level three-scale cost volume pyramid is constructed from the unary features. Intra-Scale Fusion is performed where disparity dimensions are of identical definition, thus the stereo features are summed accordingly; Cross-Scale Aggregation is performed where disparity dimensions are of different definitions, thus the stereo features are expended and concatenated with the lower-resolution feature. The image correspondence information is therefore preserved.
  • Figure 4: The characteristics of the proposed DAPE (Disparity-Aware Positional Encoding, see \ref{['sec:ts3d:decoder:dape']}). (a) Given an input left image, we sample a foreground pixel (yellow circle) and a background pixel (blue circle). The disparity of the foreground pixel is approximately $12$, thus the DAPE heatmap at $d = 12$ is visualized as (b). For the foreground and background pixels, their positional encodings are respectively dot producted by the encodings at all pixels, and the resultant heatmaps are visualized as (c) and (d), respectively. We then respectively mask the input left image with those heatmaps in (e) and (f), demonstrating that DAPE focuses on the areas with similar disparity distributions.
  • Figure 5: Visualization of three detection results of our TS3D on the KITTI validation set, one column each. From top to bottom of each sample: inverse-projection of disparity estimation and 3D detection (pink), left image with projected 3D detection (pink), and left image with 3D ground-truth boxes (green).