Table of Contents
Fetching ...

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan, Burhaneddin Yaman, Senem Velipasalar, Liu Ren

TL;DR

This work introduces CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow and achieves significant and consistent improvements over the SOTA.

Abstract

Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

TL;DR

This work introduces CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow and achieves significant and consistent improvements over the SOTA.

Abstract

Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.
Paper Structure (23 sections, 6 equations, 4 figures, 7 tables)

This paper contains 23 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The overview of CLIP-BEVFormer. The architecture integrates two key modules: the Ground Truth BEV (GT-BEV) module employs a contrastive learning framework inspired by CLIP clip to enrich the quality of BEV representations, while the Ground Truth Query Interaction (GT-QI) module introduces ground truth flow guidance into perception decoding processes. This integration leads to superior 3D object detection performance, as demonstrated in our extensive experiments on the challenging nuScenes dataset nuscenes.
  • Figure 2: Visualization results on nuScenes validation set. We demonstrate qualitative detection performance on both camera and BEV images. As can be seen in BEV images, CLIP-BEVFormer demonstrates improved alignment with ground truth detections.
  • Figure 3: Visualization results on nuScenes validation set. We demonstrate qualitative detection performance on both camera and BEV images. As can be seen in BEV images, our CLIP-BEVFormer method demonstrates improved alignment with ground truth detections.
  • Figure 4: Visualization results on nuScenes validation set. Our CLIP-BEVFormer demonstrates improved alignment with ground truth detections on both camera and BEV images.