AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion
Liuyue Xie, Jiancong Guo, Ozan Cakmakci, Andre Araujo, Laszlo A. Jeni, Zhiheng Jia
TL;DR
AlignDiff tackles the challenge of calibrating cameras in unconstrained, real-world settings by jointly estimating intrinsic and extrinsic parameters using a unified ray-camera model. It introduces a diffusion-based framework conditioned on geometric priors, line embeddings, and edge-aware guidance, coupled with DistortionNet to profile and undistort aberrations. The approach is grounded in authentic lens data (≈$3000$ designs) to improve generalization to diverse distortions and real-world sequences. Empirical results show substantial reductions in ray angular error and improved pose accuracy on both aberrated and rectified data, demonstrating strong zero-shot and out-of-distribution performance with practical impact for in-the-wild 3D perception applications.
Abstract
Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
