Table of Contents
Fetching ...

CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye

TL;DR

The paper tackles robust, scalable 3D object detection from point clouds by introducing CT3D, a two-stage framework that refines RPN proposals using a raw-point embedding, a Transformer encoder, and a channel-wise decoder. Building on CT3D, CT3D++ integrates geometric-semantic fusion and a novel Point-to-Key Bidirectional Cross-Attention (PBC) to efficiently model proposal-aware features with reduced computation. The authors demonstrate state-of-the-art performance on Waymo and KITTI, and provide extensive ablations showing the benefits of BEV semantic fusion, category-aware sampling, and the PBC scheme. The work offers flexible, hardware-efficient refiners that can plug into existing RPNs, with practical impact for autonomous driving and robotics where fast, accurate 3D understanding is essential.

Abstract

The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection with minimal hand-crafted design. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Way\-mo Open Dataset. The source code for our frameworks will be made accessible at https://github.com/hlsheng1/CT3D-plusplus.

CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

TL;DR

The paper tackles robust, scalable 3D object detection from point clouds by introducing CT3D, a two-stage framework that refines RPN proposals using a raw-point embedding, a Transformer encoder, and a channel-wise decoder. Building on CT3D, CT3D++ integrates geometric-semantic fusion and a novel Point-to-Key Bidirectional Cross-Attention (PBC) to efficiently model proposal-aware features with reduced computation. The authors demonstrate state-of-the-art performance on Waymo and KITTI, and provide extensive ablations showing the benefits of BEV semantic fusion, category-aware sampling, and the PBC scheme. The work offers flexible, hardware-efficient refiners that can plug into existing RPNs, with practical impact for autonomous driving and robotics where fast, accurate 3D understanding is essential.

Abstract

The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection with minimal hand-crafted design. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Way\-mo Open Dataset. The source code for our frameworks will be made accessible at https://github.com/hlsheng1/CT3D-plusplus.
Paper Structure (44 sections, 12 equations, 8 figures, 11 tables)

This paper contains 44 sections, 12 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: General point cloud-based 3D object detection framework. We advocate to design flexible architecture with replaceable RPN.
  • Figure 2: The overall framework of our proposed CT3D framework. First, our CT3D utilizes an arbitrary RPN to generate coarse 3D proposals. Then, the raw points are gathered and processed using the proposed raw-point-based embedding module. Afterwards, the encoded point features are transformed into an effective proposal feature representation by using three-layer standard Transformer encoder and one novel channel-wise decoder. Here, $\mathbf{K}$ and $\mathbf{V}$ are obtained by linear projection from $\mathbf{X}$. $\mathbf{Q}$ and $\mathbf{W}$ are learnable parameters.
  • Figure 3: Illustration of the different decoding schemes: (a) Standard decoding; (b) Channel-wise re-weighting; (c) Extended channel-wise re-weighting.
  • Figure 4: The failure case analysis of Voxel R-CNN deng2021voxel and our proposed CT3D. The predicted and ground-truth bounding boxes are shown in green and red, respectively. They generate biased bounding boxes and wrong confidence estimation, respectively. Instead, our newly proposed CT3D++ has good performance on these cases.
  • Figure 5: The overall framework of our proposed CT3D++ framework. First, our CT3D++ utilizes an arbitrary RPN to generate coarse 3D proposals and a latent feature map. Then, the raw points and BEV features are gathered based on the 3D proposals using the proposed geometric and semantic fusion embedding module. After that, an efficient and low-cost point-to-key bidirectional cross-attention scheme is proposed to improve the point features and assign more attention to the foreground points. Finally, the encoded point features are decoded and used to predict the refined 3D object detection results.
  • ...and 3 more figures