Table of Contents
Fetching ...

Deep Learning for Camera Calibration and Beyond: A Survey

Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin, Jing Zhang, Yao Zhao, Moncef Gabbouj, Dacheng Tao

TL;DR

This survey provides a comprehensive overview of learning-based camera calibration, organizing methods into standard, distortion, cross-view, and cross-sensor models. It contrasts regression-based and reconstruction-based paradigms, discusses geometric priors, geometry fields, and joint calibration with NeRF-like techniques, and introduces a public, multi-faceted benchmark. Key contributions include a fine-grained taxonomy, critical discussion of datasets and evaluation metrics, and a forward-looking agenda highlighting priors, sequence modeling, and implicit representations. The work emphasizes practical impact for automatic, target-free calibration across diverse cameras and sensor combinations, with implications for autonomous systems and robust 3D vision. The accompanying open-source repository enables ongoing tracking of methods, datasets, and benchmarks.

Abstract

Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations. Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model, following the research trend and extended applications. As there is no unified benchmark in this community, we collect a holistic calibration dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration (spanned 10 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.

Deep Learning for Camera Calibration and Beyond: A Survey

TL;DR

This survey provides a comprehensive overview of learning-based camera calibration, organizing methods into standard, distortion, cross-view, and cross-sensor models. It contrasts regression-based and reconstruction-based paradigms, discusses geometric priors, geometry fields, and joint calibration with NeRF-like techniques, and introduces a public, multi-faceted benchmark. Key contributions include a fine-grained taxonomy, critical discussion of datasets and evaluation metrics, and a forward-looking agenda highlighting priors, sequence modeling, and implicit representations. The work emphasizes practical impact for automatic, target-free calibration across diverse cameras and sensor combinations, with implications for autonomous systems and robust 3D vision. The accompanying open-source repository enables ongoing tracking of methods, datasets, and benchmarks.

Abstract

Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations. Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model, following the research trend and extended applications. As there is no unified benchmark in this community, we collect a holistic calibration dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration (spanned 10 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.
Paper Structure (48 sections, 7 equations, 11 figures, 2 tables)

This paper contains 48 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Popular calibration objectives, models, and extended applications in camera calibration.
  • Figure 2: The structural and hierarchical taxonomy of camera calibration with deep learning. Some classical methods are listed under each category.
  • Figure 3: Overview of CTRL-C. The figure is from CTRL-C. It estimates parameters including the zenith VP, FoV, and horizon line for camera calibration from an input image and a set of line segments. Moreover, two auxiliary outputs (vertical and horizontal convergence line scores) guide the network in learning scene geometry for calibration.
  • Figure 4: Three common learning solutions of the regression-based wide-angle camera calibration: (a) SingleNet, (b) DualNet, (c) SeqNet, where $\mathbf{I}$ is the distortion image and $f$ and $\xi$ denote the focal length and distortion parameters, respectively. The figure is from DeepCalib.
  • Figure 5: Architecture of FE-GAN. The figure is from FE-GAN. It consists of two components: a generator $G = (U, W)$ that rectifies the distortion image $x$, and a discriminator $D = (D_{adv}, D_{cls})$. The module $U$ in $G$ predicts the distortion flow $f = U(x)$, while $W$ rectifies the distortion image using $f$.
  • ...and 6 more figures