Table of Contents
Fetching ...

VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion

Vanshika Vats, Marzia Binta Nizam, James Davis

TL;DR

VaLID addresses robust 3D vehicle detection by verifying LiDAR detections with camera detections in a late-fusion framework. A compact verification module outputs a probability $P_i = \sigma(f_{\theta}(I_i))$ for each LiDAR box and rescales the final confidence as $S'_L = S_L \cdot P_i$, biasing toward recall while filtering false positives. On KITTI, using LiDAR detectors PV-RCNN or TED and camera detectors MonoDETR, YOLO-NAS, or GroundingDINO, VaLID achieves substantial false-positive reductions (average around $63.9\%$) and delivers competitive 3D AP compared with state-of-the-art fusion methods. Importantly, the approach does not require dataset-specific fine-tuning and remains effective with general open-vocabulary detectors, supporting practical deployment across diverse sensing setups.

Abstract

Vehicle object detection benefits from both LiDAR and camera data, with LiDAR offering superior performance in many scenarios. Fusion of these modalities further enhances accuracy, but existing methods often introduce complexity or dataset-specific dependencies. In our study, we propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a lightweight neural verification network trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 3D average precision (3DAP). Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.

VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion

TL;DR

VaLID addresses robust 3D vehicle detection by verifying LiDAR detections with camera detections in a late-fusion framework. A compact verification module outputs a probability for each LiDAR box and rescales the final confidence as , biasing toward recall while filtering false positives. On KITTI, using LiDAR detectors PV-RCNN or TED and camera detectors MonoDETR, YOLO-NAS, or GroundingDINO, VaLID achieves substantial false-positive reductions (average around ) and delivers competitive 3D AP compared with state-of-the-art fusion methods. Importantly, the approach does not require dataset-specific fine-tuning and remains effective with general open-vocabulary detectors, supporting practical deployment across diverse sensing setups.

Abstract

Vehicle object detection benefits from both LiDAR and camera data, with LiDAR offering superior performance in many scenarios. Fusion of these modalities further enhances accuracy, but existing methods often introduce complexity or dataset-specific dependencies. In our study, we propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a lightweight neural verification network trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 3D average precision (3DAP). Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.
Paper Structure (16 sections, 5 equations, 6 figures, 3 tables)

This paper contains 16 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We conduct our study on the KITTI dataset. The figure rows show (a) official 2D ground truth (b) detections from the specialized LiDAR model PVRCNN, (c) specialized camera model MonoDETR, and (d) open vocabulary model GroundingDino. LiDAR generally produces too many false positives, seen most obviously in Column [i]. Camera models can help verify and reject false positives despite imperfections in their own detections (e.g. Missed detection of partial car in Row[c]Col[ii], and Imperfect dimensions in Row[d]Col[iii]).
  • Figure 2: Precision-Recall curve (PRC) of our baseline single modality models on the KITTI moderate difficulty data set. Notice that both LiDAR models outperform all three of the camera models and that the specialized camera detector outperforms the two general-purpose camera-based object detectors.
  • Figure 3: The distribution of false positives and true positives across confidence score bands for one LiDAR and two camera object detection models on the KITTI moderate set. Notice that the LiDAR method has false positives across all confidence bands, and that the camera methods have different and sometimes complementary distributions.
  • Figure 4: In our proposed method, VaLID, we take LiDAR ($b_L$) boxes as the primary detections and use the camera ($b_C$) modality to verify whether each $b_L$ is an acceptable detection or not. The input vector is represented as the bounding box dimensions of width $w$, height $h$, and center $(cx, cy)$, confidence scores $s$, and the measure of overlap $IoU$. We pass this vector through a neural verification module which outputs an acceptance probability as a sigmoid value. The training objective is defined based on the overlap between the ground truth ($b_{GT}$) with the given $b_L$, optimized using a weighted BCE loss to encourage high recall. During inference, the accepted box confidence is rescored by multiplying it with the output sigmoid probability in order to reduce the confidence of the lingering false positives.
  • Figure 5: Our fusion method is able to reduce the false positives significantly on (a) PVRCNN, and (b) TED model detections. Note that this improvement is achieved for all tested camera methods used for verification.
  • ...and 1 more figures