Table of Contents
Fetching ...

Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

Wenxuan Li, Qin Zou, Chi Chen, Bo Du, Long Chen, Jian Zhou, Hongkai Yu

TL;DR

Co-Fix3D integrates Local and Global Enhancement modules to refine Bird's Eye View (BEV) features and adopts multi-head LGE modules, enabling each module to focus on targets with different levels of detection complexity, thus further enhancing overall perception capability.

Abstract

3D object detection in driving scenarios faces the challenge of complex road environments, which can lead to the loss or incompleteness of key features, thereby affecting perception performance. To address this issue, we propose an advanced detection framework called Co-Fix3D. Co-Fix3D integrates Local and Global Enhancement (LGE) modules to refine Bird's Eye View (BEV) features. The LGE module uses Discrete Wavelet Transform (DWT) for pixel-level local optimization and incorporates an attention mechanism for global optimization. To handle varying detection difficulties, we adopt multi-head LGE modules, enabling each module to focus on targets with different levels of detection complexity, thus further enhancing overall perception capability. Experimental results show that on the nuScenes dataset's LiDAR benchmark, Co-Fix3D achieves 69.4\% mAP and 73.5\% NDS, while on the multimodal benchmark, it achieves 72.3\% mAP and 74.7\% NDS. The source code is publicly available at \href{https://github.com/rubbish001/Co-Fix3d}{https://github.com/rubbish001/Co-Fix3d}.

Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

TL;DR

Co-Fix3D integrates Local and Global Enhancement modules to refine Bird's Eye View (BEV) features and adopts multi-head LGE modules, enabling each module to focus on targets with different levels of detection complexity, thus further enhancing overall perception capability.

Abstract

3D object detection in driving scenarios faces the challenge of complex road environments, which can lead to the loss or incompleteness of key features, thereby affecting perception performance. To address this issue, we propose an advanced detection framework called Co-Fix3D. Co-Fix3D integrates Local and Global Enhancement (LGE) modules to refine Bird's Eye View (BEV) features. The LGE module uses Discrete Wavelet Transform (DWT) for pixel-level local optimization and incorporates an attention mechanism for global optimization. To handle varying detection difficulties, we adopt multi-head LGE modules, enabling each module to focus on targets with different levels of detection complexity, thus further enhancing overall perception capability. Experimental results show that on the nuScenes dataset's LiDAR benchmark, Co-Fix3D achieves 69.4\% mAP and 73.5\% NDS, while on the multimodal benchmark, it achieves 72.3\% mAP and 74.7\% NDS. The source code is publicly available at \href{https://github.com/rubbish001/Co-Fix3d}{https://github.com/rubbish001/Co-Fix3d}.
Paper Structure (23 sections, 3 equations, 6 figures, 7 tables)

This paper contains 23 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparing TransFusion, FocalFormer3D, and our proposed Co-Fix3D. (a) TransFusion utilizes BEV features to generate heatmap score maps, selecting the highest-scoring K cells as queries. The quality of these queries directly impacts TransFusion's performance. (b) FocalFormer3D uses a multi-stage approach with masking techniques to filter out easily detectable targets early. This increases queries and boosts detection by improving recall rates. (c) Co-Fix3D enhances BEV features with the LGE module and uses a parallel structure to adaptively process sample features across stages. This method generates high-quality query sets for accurate predictions in the Transformer module.
  • Figure 2: Overview of Co-Fix3D. Raw point cloud data is processed through a 3D Backbone network to generate LiDAR BEV features, while image data is processed through a 2D network and LSS to produce image BEV features. These features are fused into a new BEV representation using a Reduce Conv module and subsequently optimized through a multi-stage process leveraging Enhancement Module (LGE). At each stage, a top-k strategy is employed to select the highest-scoring queries for that stage, with a mask applied to prevent overlap between the selected queries and those from the previous stage. Finally, the $K\times N$ candidates are decoded to produce the detection outputs.
  • Figure 3: Details of the LGE Module.
  • Figure 4: The set of variants with different types of encoders.
  • Figure 5: The impact of LGE on features. By comparing (a) and (b), we found that the features within the red area in (b) are significantly better than those in (a).
  • ...and 1 more figures