VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

Ziying Song; Guoxin Zhang; Jun Xie; Lin Liu; Caiyan Jia; Shaoqing Xu; Zhepeng Wang

VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

Ziying Song, Guoxin Zhang, Jun Xie, Lin Liu, Caiyan Jia, Shaoqing Xu, Zhepeng Wang

TL;DR

VoxelNextFusion presents a simple, unified voxel fusion framework for LiDAR-camera 3D object detection. It introduces Patch-Point Fusion (P$^2$-Fusion) to fuse patch-level image features with voxel features and Foreground-Background Fusion (FB-Fusion) to emphasize informative foreground regions, mitigating background interference. The approach yields consistent improvements on KITTI and nuScenes, particularly for long-range and sparse-point objects, and demonstrates robustness across multiple voxel-based baselines. By preserving image semantics and continuity while densifying sparse voxel representations, it advances multi-modal fusion in autonomous driving with strong practical impact. The method achieves notable gains in both 3D and BEV metrics, validating its effectiveness and generality.

Abstract

LiDAR-camera fusion can enhance the performance of 3D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity information, leading to sub-optimal detection performance, especially at long distances. In this paper, we present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods, which effectively bridges the gap between sparse point clouds and dense images. In particular, we propose a voxel-based image pipeline that involves projecting point clouds onto images to obtain both pixel- and patch-level features. These features are then fused using a self-attention to obtain a combined representation. Moreover, to address the issue of background features present in patches, we propose a feature importance module that effectively distinguishes between foreground and background features, thus minimizing the impact of the background features. Extensive experiments were conducted on the widely used KITTI and nuScenes 3D object detection benchmarks. Notably, our VoxelNextFusion achieved around +3.20% in AP@0.7 improvement for car detection in hard level compared to the Voxel R-CNN baseline on the KITTI test dataset

VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

TL;DR

VoxelNextFusion presents a simple, unified voxel fusion framework for LiDAR-camera 3D object detection. It introduces Patch-Point Fusion (P

-Fusion) to fuse patch-level image features with voxel features and Foreground-Background Fusion (FB-Fusion) to emphasize informative foreground regions, mitigating background interference. The approach yields consistent improvements on KITTI and nuScenes, particularly for long-range and sparse-point objects, and demonstrates robustness across multiple voxel-based baselines. By preserving image semantics and continuity while densifying sparse voxel representations, it advances multi-modal fusion in autonomous driving with strong practical impact. The method achieves notable gains in both 3D and BEV metrics, validating its effectiveness and generality.

Abstract

Paper Structure (29 sections, 7 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 29 sections, 7 equations, 6 figures, 10 tables, 2 algorithms.

Introduction
Related work
3D Object Detection with Single Modality
3D Object Detection with Multi-modalities
VoxelNextFusion
Patch-Point Fusion
Projection
Fusion
Foreground-Background Fusion
Experiments
Dataset and Evaluation Metrics
KITTI dataset
nuScenes dataset
Implementation Details
Network Architecture
...and 14 more sections

Figures (6)

Figure 1: (a) To fuse point clouds and images accurately, state-of-the-art methods leverage one-to-one projection to correspond 3D-2D coordinates. However, due to the inconsistent resolution of the two modalities, for instance, in the case of a long-range object such as a car marked in green, it contains 14 LiDAR points and more than 200 pixels. (b)To tackle this issue, we propose the VoxelNextFusion strategy that combines the one-to-many and one-to-one approaches to enlarge the usage of pixels. (c)The experiments demonstrate that our VoxelNextFusion significantly improves detection performance, particularly for long-range objects.
Figure 2: Point Cloud Count Distribution by Difficulty Levels in KITTI GT Bounding Boxes. The data is sourced from the GT statistics of cars in the KITTIkitti train dataset, comprising a total of 14,357 points. Among these, there are 3,153 points categorized as "easy," 4,893 points categorized as "moderate," and 2,971 points categorized as "hard." A lower point count within the GT bounding box indicates higher detection difficulty, with "Hard" cases being the most prevalent. As shown in Table \ref{['tab_kitti_test']}, a 3.20% improvement on the "Hard" category demonstrates the effectiveness of our VoxelNextFusion.
Figure 3: The framework of our VoxelNextFusion. First, we voxelized the points cloud and fed it into a 3D sparse convolution backbone. In the image branch, the image is fed into a 2D encoder. After that, we project the sparse voxel feature onto the image feature to conduct P$^2$-Fusion (Patch-Point Fusion) module. Second, we adopt the FB-Fusion (Foreground-Background Fusion) module that can weight features according to their foreground or background scores. Finally, the weighted feature is fed into a 3D convolution block and used to predict results. 'SAF' represents the self-attention Fusion module.
Figure 4: Comparison of one-to-one and one-to-many fusion. The green square represents the features of the projected pixels, the yellow square represents the features of the unprojected pixels, and the light green square represents the features of the neighboring pixels of the projected pixels.
Figure 5: Illustration of Splitting Foreground-Background. We note that this is a 2D example and can be easily extended to 3D cases. Compared the $\mathbf{F_{imp}}$ with $\mathcal{T}$, we partition the Foreground features and Background features. To enhance the density of Foreground features, we utilize the 'EXPAND' operation to repeat the Foreground features to their surroundings ${K_{S}}^{3} - 1$ neighbors. Compared the $\mathbf{F_{imp}}$ with $\mathcal{T}$, We discriminate between the Expanded Fore. and Expanded Back . Subsequently, we employ the 'DISCARD' operation to eliminate the Expanded Back.
...and 1 more figures

VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

TL;DR

Abstract

VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)