VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion
Vanshika Vats, Marzia Binta Nizam, James Davis
TL;DR
VaLID addresses robust 3D vehicle detection by verifying LiDAR detections with camera detections in a late-fusion framework. A compact verification module outputs a probability $P_i = \sigma(f_{\theta}(I_i))$ for each LiDAR box and rescales the final confidence as $S'_L = S_L \cdot P_i$, biasing toward recall while filtering false positives. On KITTI, using LiDAR detectors PV-RCNN or TED and camera detectors MonoDETR, YOLO-NAS, or GroundingDINO, VaLID achieves substantial false-positive reductions (average around $63.9\%$) and delivers competitive 3D AP compared with state-of-the-art fusion methods. Importantly, the approach does not require dataset-specific fine-tuning and remains effective with general open-vocabulary detectors, supporting practical deployment across diverse sensing setups.
Abstract
Vehicle object detection benefits from both LiDAR and camera data, with LiDAR offering superior performance in many scenarios. Fusion of these modalities further enhances accuracy, but existing methods often introduce complexity or dataset-specific dependencies. In our study, we propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a lightweight neural verification network trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 3D average precision (3DAP). Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.
