Table of Contents
Fetching ...

What Really Matters for Learning-based LiDAR-Camera Calibration

Shujuan Huang, Chunyu Lin, Yao Zhao

TL;DR

This paper addresses the practical gap in learning-based LiDAR-Camera calibration by analyzing why regression-based methods underperform in real-world, cross-sensor scenarios. It shows that such methods often memorize depth-map distributions rather than establish true cross-modality correspondences, a flaw exposed through a minimal regression framework and extensive ablations. The study critiques common data-generation pipelines, demonstrating that synthetic perturbations and preprocessing biases limit generalization, as evidenced by cross-camera and KITTI-360 experiments. The authors advocate a shift toward matching-based calibration with explicit geometric constraints, arguing this approach better preserves cross-modal relationships and generalizes to unseen sensor configurations, thus advancing practical online calibration for robust multi-sensor fusion.

Abstract

Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.

What Really Matters for Learning-based LiDAR-Camera Calibration

TL;DR

This paper addresses the practical gap in learning-based LiDAR-Camera calibration by analyzing why regression-based methods underperform in real-world, cross-sensor scenarios. It shows that such methods often memorize depth-map distributions rather than establish true cross-modality correspondences, a flaw exposed through a minimal regression framework and extensive ablations. The study critiques common data-generation pipelines, demonstrating that synthetic perturbations and preprocessing biases limit generalization, as evidenced by cross-camera and KITTI-360 experiments. The authors advocate a shift toward matching-based calibration with explicit geometric constraints, arguing this approach better preserves cross-modal relationships and generalizes to unseen sensor configurations, thus advancing practical online calibration for robust multi-sensor fusion.

Abstract

Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.

Paper Structure

This paper contains 11 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The transformation between a point cloud captured from LiDAR and an image captured by camera. The target of LiDAR-Camera calibration is to estimate the coordinate transformation.
  • Figure 2: Two mainstream learning-based LiDAR-Camera calibration framework, from the perspective of the output format of the network. Regression-based methods predict extrinsic parameters directly in an end-to-end manner. Matching-based methods predict correspondences af first then resolve the extrinsic by explicit geometry solver. (a) The regression-based paradigm (b). The matching-based paradigm.
  • Figure 3: Rotation and translation error distributions of the cross-camera test, which are derived from the single-branch test (±0.5m / ±5°) with only depth as input, show minor variations for most components, except for the x-axis translation.
  • Figure 4: The classification accuracy comparisons of the cross-camera test, which are derived from the dual-branch test (±1.5m / ±20°), show a significant performance drop in x-axis translation. After regenerating labels using ground truth matrices multiplied by the transformation from the left camera to the right camera, the accuracy can be restored to the previous level.
  • Figure 5: The working principles of regression-based methods are illustrated. In the training stage, the network learns to memorize the mapping relationship between depth map distributions and extrinsic parameters. In the test stage, the network predicts the extrinsics based on the similarity between training data and test samples.
  • ...and 1 more figures