CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan; Burhaneddin Yaman; Senem Velipasalar; Liu Ren

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan, Burhaneddin Yaman, Senem Velipasalar, Liu Ren

TL;DR

This work introduces CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow and achieves significant and consistent improvements over the SOTA.

Abstract

Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 4 figures, 7 tables)

This paper contains 23 sections, 6 equations, 4 figures, 7 tables.

Introduction
Related Works
Bird's Eye View Feature Generation
Vision-Language Models
Contrastive Learning in Computer Vision
Methodology
Preliminary
Ground Truth BEV
Ground Truth Query Interaction
Loss
Experiments
Implementation Details
3D Detection Results
Long-tail Detection Results
Robustness Results
...and 8 more sections

Figures (4)

Figure 1: The overview of CLIP-BEVFormer. The architecture integrates two key modules: the Ground Truth BEV (GT-BEV) module employs a contrastive learning framework inspired by CLIP clip to enrich the quality of BEV representations, while the Ground Truth Query Interaction (GT-QI) module introduces ground truth flow guidance into perception decoding processes. This integration leads to superior 3D object detection performance, as demonstrated in our extensive experiments on the challenging nuScenes dataset nuscenes.
Figure 2: Visualization results on nuScenes validation set. We demonstrate qualitative detection performance on both camera and BEV images. As can be seen in BEV images, CLIP-BEVFormer demonstrates improved alignment with ground truth detections.
Figure 3: Visualization results on nuScenes validation set. We demonstrate qualitative detection performance on both camera and BEV images. As can be seen in BEV images, our CLIP-BEVFormer method demonstrates improved alignment with ground truth detections.
Figure 4: Visualization results on nuScenes validation set. Our CLIP-BEVFormer demonstrates improved alignment with ground truth detections on both camera and BEV images.

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

TL;DR

Abstract

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Authors

TL;DR

Abstract

Table of Contents

Figures (4)