DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation
Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo
TL;DR
DiffCalib tackles monocular camera calibration from a single image by reframing intrinsic estimation as diffusion-based dense incident-map generation, leveraging pre-trained diffusion priors. It jointly learns an incident map and a depth map, enabling RANSAC-based recovery of the camera intrinsics and enabling accurate 3D reconstruction in-the-wild. The method achieves state-of-the-art calibration accuracy, improves zero-shot 3D reconstruction, and benefits downstream applications through robust depth and intrinsic estimates, with ablations validating the contributions of joint learning and ensemble inference. This approach demonstrates the practical value of diffusion priors for geometric vision tasks and suggests promising directions for diffusion-guided 3D reasoning from monocular imagery.
Abstract
Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.
