SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

Qiu Zhou; Jinming Cao; Hanchao Leng; Yifang Yin; Yu Kun; Roger Zimmermann

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

Qiu Zhou, Jinming Cao, Hanchao Leng, Yifang Yin, Yu Kun, Roger Zimmermann

TL;DR

The proposed SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection) leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection and leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems.

Abstract

In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
3D Object Detection (OD)
3D Semantic Occupancy Prediction (OC)
Method
Overall Architecture and Notations
Image Backbone
View Transformer
Depth-aware Multi-view Fusion.
4D Temporal Fusion.
Task Stage
Modality-fusion Module.
Occupancy Label Generation
Training Objectives
Losses of OD Branch.
...and 13 more sections

Figures (5)

Figure 1: Illustration of 3D object detection and semantic occupancy prediction tasks. On the rightmost legend, the top 10 categories in the blue box are shared for both tasks, and the bottom 6 categories in the green box are exclusively used by semantic occupancy prediction. (a) 3D object detection usually focuses on objects on roads, such as bicycles and cars. In contrast, 3D semantic occupancy prediction (b) concerns more about physical contexts (e.g., sidewalk and vegetation) in the environment. By combining these two (c), we can obtain a more comprehensive perception of the traffic conditions, such as pedestrians and bicycles mainly on the sidewalk and cars and buses co-appearing on drive surface.
Figure 2: The overall network architecture. Our approach includes an image backbone to encode multi-view input images to the vision feature, a view transformer to transform the vision feature into BEV feature, and a task stage comprising OD and OC branches that respectively predict the OD and OC outputs in the same time.
Figure 3: Illustration of the two types of labels.
Figure 4: Visualization for the OD and OC branches of SOGDet. The input consists of six multi-view images. For both the output and the GT (red box) column, from top to bottom, we sequentially show the predictions of SOGDet-SE for OD, SOGDet-SE for OC and SOGDet-BO for OC. The Hybrid feature is blended from OD and OC branch predictions of SOGDet-SE.
Figure 5: Parameter count (Param.) and floating-point operations (FLOPs).

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

TL;DR

Abstract

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)