Table of Contents
Fetching ...

CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network

Yuxuan Xiao, Yao Li, Chengzhen Meng, Xingchen Li, Jianmin Ji, Yanyong Zhang

TL;DR

CalibFormer tackles LiDAR-camera extrinsic calibration in an online, targetless setting by learning to estimate the $6$-DoF transformation between sensors from miscalibrated inputs. It fuses multi-layer camera and LiDAR features, uses a multi-head correlation module to capture cross-modality correspondences, and applies a Swin Transformer encoder plus Transformer decoder to regress translation and rotation deviations. A composite loss with translation, rotation, and point-cloud distance guides training, and training data is augmented with random extrinsic deviations to simulate drift. On KITTI, CalibFormer achieves state-of-the-art accuracy ($0.8751 cm$ translation, $0.0562 deg$ rotation under challenging miscalibrations) and shows strong generalization to unseen scenes, at the cost of increased computation. The approach demonstrates robust, precise, and generalizable calibration suitable for online autonomous systems.

Abstract

The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $0.8751 \mathrm{cm}$ and a mean rotation error of $0.0562 ^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.

CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network

TL;DR

CalibFormer tackles LiDAR-camera extrinsic calibration in an online, targetless setting by learning to estimate the -DoF transformation between sensors from miscalibrated inputs. It fuses multi-layer camera and LiDAR features, uses a multi-head correlation module to capture cross-modality correspondences, and applies a Swin Transformer encoder plus Transformer decoder to regress translation and rotation deviations. A composite loss with translation, rotation, and point-cloud distance guides training, and training data is augmented with random extrinsic deviations to simulate drift. On KITTI, CalibFormer achieves state-of-the-art accuracy ( translation, rotation under challenging miscalibrations) and shows strong generalization to unseen scenes, at the cost of increased computation. The approach demonstrates robust, precise, and generalizable calibration suitable for online autonomous systems.

Abstract

The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of and a mean rotation error of on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
Paper Structure (19 sections, 9 equations, 3 figures, 4 tables)

This paper contains 19 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of our proposed method for camera and LiDAR calibration. Firstly, we project the LiDAR point cloud onto the image plane, generating a miscalibrated LiDAR image using the initial extrinsic parameter $\mathbf{T}_{init}$ and the camera matrix $\mathbf{K}$. Our network takes both camera images and LiDAR images as inputs. After extracting fine-grained features, we employ a multi-head correlation module and a transformer architecture to obtain a 6-DoF transformation $\mathbf{T}_{pred}$ representing the deviation between the initial extrinsic parameter $\mathbf{T}_{init}$ and the accurate extrinsic parameter $\mathbf{T}_{LC}$.
  • Figure 2: Overview of deep layer aggregation. After obtaining the features generated by the backbone at different layers, these features are respectively upsampled and aggregated to obtain a high-resolution feature map.
  • Figure 3: Examples of calibration results for different scenes on the KITTI dataset. (a) represents the projection of miscalibrated point clouds onto the image plane. (b) shows the projection result of the point cloud using the network's predicted extrinsic parameters, and (c) represents the corresponding ground truth result.