Table of Contents
Fetching ...

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Liuyue Xie, Jiancong Guo, Ozan Cakmakci, Andre Araujo, Laszlo A. Jeni, Zhiheng Jia

TL;DR

AlignDiff tackles the challenge of calibrating cameras in unconstrained, real-world settings by jointly estimating intrinsic and extrinsic parameters using a unified ray-camera model. It introduces a diffusion-based framework conditioned on geometric priors, line embeddings, and edge-aware guidance, coupled with DistortionNet to profile and undistort aberrations. The approach is grounded in authentic lens data (≈$3000$ designs) to improve generalization to diverse distortions and real-world sequences. Empirical results show substantial reductions in ray angular error and improved pose accuracy on both aberrated and rectified data, demonstrating strong zero-shot and out-of-distribution performance with practical impact for in-the-wild 3D perception applications.

Abstract

Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

TL;DR

AlignDiff tackles the challenge of calibrating cameras in unconstrained, real-world settings by jointly estimating intrinsic and extrinsic parameters using a unified ray-camera model. It introduces a diffusion-based framework conditioned on geometric priors, line embeddings, and edge-aware guidance, coupled with DistortionNet to profile and undistort aberrations. The approach is grounded in authentic lens data (≈ designs) to improve generalization to diverse distortions and real-world sequences. Empirical results show substantial reductions in ray angular error and improved pose accuracy on both aberrated and rectified data, demonstrating strong zero-shot and out-of-distribution performance with practical impact for in-the-wild 3D perception applications.

Abstract

Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.

Paper Structure

This paper contains 18 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: AlignDiff is proposed to address common image geometric aberrations with a unified ray camera representation while jointly recovering the camera extrinsics. With groundings on physical camera lens designs, as well as the disassociation of geometric cues from semantic features, it demonstrates an ability to generalize to real video sequences.
  • Figure 2: AlignDiff Architecture. We promote learning camera ray profiles in three main steps: geometric cue conditioning from line features, edge-aware attention, and physical camera groundings.
  • Figure 3: The latents are reweighted through edge-attention that aggregates a learned mask to the Query feature to promote information along edges where the geometric cues are more prominent. Following, the multi-view attention further captures features from different views.
  • Figure 4: Recovered aberration profile and undistorted images. From denoised rays in world space, the DistortNet estimates the aberration pattern represented in warping flow. The undistorted images maintain coherent structural distribution compared to aberration-free images.
  • Figure 5: Ray-traced lens designs are utilized, enabling the accurate simulation of geometric aberrations. The aberrations are encoded as local geometric distortion ISO17850, with a percentage deviation of pixels from their ideal positions on a regular grid.
  • ...and 7 more figures