S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

Xuan He; Jin Yuan; Kailun Yang; Zhenchao Zeng; Zhiyong Li

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

Xuan He, Jin Yuan, Kailun Yang, Zhenchao Zeng, Zhiyong Li

TL;DR

A novel ``Supervised Shape&Scale-perceptive Deformable Attention'' (S$^3$-DA) module for monocular 3D object detection that significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches.

Abstract

Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervised attention mechanisms without any geometry appearance awareness in transformers are susceptible to producing noisy features for query points, which severely limits the network performance and also makes the model have a poor ability to detect multi-category objects in a single training process. To tackle this problem, this paper proposes a novel ``Supervised Shape&Scale-perceptive Deformable Attention'' (S$^3$-DA) module for monocular 3D object detection. Concretely, S$^3$-DA utilizes visual and depth features to generate diverse local features with various shapes and scales and predict the corresponding matching distribution simultaneously to impose valuable shape&scale perception for each query. Benefiting from this, S$^3$-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features. Besides, we propose a Multi-classification-based Shape&Scale Matching (MSM) loss to supervise the above process. Extensive experiments on KITTI and Waymo Open datasets demonstrate that S$^3$-DA significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches. The source code will be made publicly available at https://github.com/mikasa3lili/S3-MonoDETR.

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

TL;DR

A novel ``Supervised Shape&Scale-perceptive Deformable Attention'' (S

-DA) module for monocular 3D object detection that significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches.

Abstract

-DA) module for monocular 3D object detection. Concretely, S

-DA utilizes visual and depth features to generate diverse local features with various shapes and scales and predict the corresponding matching distribution simultaneously to impose valuable shape&scale perception for each query. Benefiting from this, S

-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features. Besides, we propose a Multi-classification-based Shape&Scale Matching (MSM) loss to supervise the above process. Extensive experiments on KITTI and Waymo Open datasets demonstrate that S

-DA significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches. The source code will be made publicly available at https://github.com/mikasa3lili/S3-MonoDETR.

Paper Structure (16 sections, 16 equations, 5 figures, 8 tables)

This paper contains 16 sections, 16 equations, 5 figures, 8 tables.

Introduction
Related Work
Methodology
Architecture Overview
Supervised Shape$\&$Scale-perceptive Deformable Attention Module
Query-specific Diverse Local Feature Extraction
Visual$\&$Depth-guided Matching Distribution Prediction
Object Query Updating
Training Loss
Experiments
Experimental Setup
Performance Comparison
Evaluation of Multi-category Joint Training
Evaluation of S$^3$-DA
Qualitative Results
...and 1 more sections

Figures (5)

Figure 1: The architecture of S$^3$-MonoDETR, where S$^3$-DA with an MSM loss is introduced to generate a shape$\&$scale-aware filter to help the object queries yield more robust query features.
Figure 2: The specific design of S$^3$-DA, which is composed of three steps: (a) query-specific diverse local feature extraction, (b) visual$\&$depth-guided matching distribution prediction, and (c) object query updating.
Figure 3: The process of category label generation for the MCM loss.
Figure 4: The shape and scale distribution of "Car", "Pedestrian", and "Cyclist" in the camera view.
Figure 5: Four representative examples to visualize the detection results between MonoDETR (left) and S$^3$-MonoDETR (right), where the red circles indicate the missing or false-detected objects. "Car" (dark blue), "Pedestrian" (light blue), "Cyclist" (yellow).

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

TL;DR

Abstract

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)