Table of Contents
Fetching ...

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo

TL;DR

DiffCalib tackles monocular camera calibration from a single image by reframing intrinsic estimation as diffusion-based dense incident-map generation, leveraging pre-trained diffusion priors. It jointly learns an incident map and a depth map, enabling RANSAC-based recovery of the camera intrinsics and enabling accurate 3D reconstruction in-the-wild. The method achieves state-of-the-art calibration accuracy, improves zero-shot 3D reconstruction, and benefits downstream applications through robust depth and intrinsic estimates, with ablations validating the contributions of joint learning and ensemble inference. This approach demonstrates the practical value of diffusion priors for geometric vision tasks and suggests promising directions for diffusion-guided 3D reasoning from monocular imagery.

Abstract

Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

TL;DR

DiffCalib tackles monocular camera calibration from a single image by reframing intrinsic estimation as diffusion-based dense incident-map generation, leveraging pre-trained diffusion priors. It jointly learns an incident map and a depth map, enabling RANSAC-based recovery of the camera intrinsics and enabling accurate 3D reconstruction in-the-wild. The method achieves state-of-the-art calibration accuracy, improves zero-shot 3D reconstruction, and benefits downstream applications through robust depth and intrinsic estimates, with ablations validating the contributions of joint learning and ensemble inference. This approach demonstrates the practical value of diffusion priors for geometric vision tasks and suggests promising directions for diffusion-guided 3D reasoning from monocular imagery.

Abstract

Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.
Paper Structure (41 sections, 9 equations, 7 figures, 7 tables)

This paper contains 41 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the generation pipeline. Given an image $x$, we generate the incident map $\hat{v}$ and depth map $\hat{d}$ using the denoising U-Net from two randomly sampled Gaussian noises $\hat{Z}_T^v$ and $\hat{Z}_T^d$. The generated $\hat{v}$ and $\hat{d}$ are projected into 3D space to recover the 3D scene shape. It is worth mentioning that the denoising process in the green part loops for $T$ times. Please see more details in § \ref{['section:C']}.
  • Figure 2: Overview of the training pipeline. We freeze the latent encoder and encode the input image $\mathbf{x}$, incident map $\mathbf{v}$, and depth map $\mathbf{d}$ into the latent space. Then, the U-Net is trained to predict the noise added to the depth and incident map latent codes, denoted as $\hat{\epsilon}_t^d$ and $\hat{\epsilon}_t^v$, respectively. Loss function is computed between the estimated noise and the added ground-truth noise.
  • Figure 3: Visualization of the estimated and ground-truth (GT) incident map on the outdoor dataset Waymo sun2020scalability.
  • Figure 4: Qualitative comparison of 3D reconstruction. We compare with LeReS yin2020learning across diverse scenes.
  • Figure 5: Visualization of zooming in camera focal length on in-the-wild scenes. We present images with increasing camera focal lengths from left to right, with our predicted focal length displayed in the bottom right corner. We notice a notable increase in the estimated focal length as we zoom in on the camera's focal length.
  • ...and 2 more figures