What Really Matters for Learning-based LiDAR-Camera Calibration
Shujuan Huang, Chunyu Lin, Yao Zhao
TL;DR
This paper addresses the practical gap in learning-based LiDAR-Camera calibration by analyzing why regression-based methods underperform in real-world, cross-sensor scenarios. It shows that such methods often memorize depth-map distributions rather than establish true cross-modality correspondences, a flaw exposed through a minimal regression framework and extensive ablations. The study critiques common data-generation pipelines, demonstrating that synthetic perturbations and preprocessing biases limit generalization, as evidenced by cross-camera and KITTI-360 experiments. The authors advocate a shift toward matching-based calibration with explicit geometric constraints, arguing this approach better preserves cross-modal relationships and generalizes to unseen sensor configurations, thus advancing practical online calibration for robust multi-sensor fusion.
Abstract
Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.
