Table of Contents
Fetching ...

Robust Multi-view Camera Calibration from Dense Matches

Johannes Hägerlind, Bao-Long Tran, Urs Waldmann, Per-Erik Forssén

TL;DR

This work tackles robust self-calibration for multi-view camera rigs using dense image correspondences, addressing both intrinsics and extrinsics in challenging distortion scenarios. It introduces a dense correspondence sampling pipeline based on RoMa, hierarchical cycle sampling, and a triangulation-angle scoring function, coupled with incremental and global SfM pipelines. The approach is validated on regular and fisheye datasets, achieving state-of-the-art or competitive AUC metrics without training and demonstrating strong performance when initialized with VGGT. The method offers a practical, interpretable alternative to black-box SfM solutions with clear applicability to real-world field deployments in animal behavior analysis and forensic reconstruction.

Abstract

Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.

Robust Multi-view Camera Calibration from Dense Matches

TL;DR

This work tackles robust self-calibration for multi-view camera rigs using dense image correspondences, addressing both intrinsics and extrinsics in challenging distortion scenarios. It introduces a dense correspondence sampling pipeline based on RoMa, hierarchical cycle sampling, and a triangulation-angle scoring function, coupled with incremental and global SfM pipelines. The approach is validated on regular and fisheye datasets, achieving state-of-the-art or competitive AUC metrics without training and demonstrating strong performance when initialized with VGGT. The method offers a practical, interpretable alternative to black-box SfM solutions with clear applicability to real-world field deployments in animal behavior analysis and forensic reconstruction.

Abstract

Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.

Paper Structure

This paper contains 30 sections, 3 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Sample forensic scenario where four mobile cameras capture a (simulated) crime scene. Top: Sparse reconstruction and camera calibration using our method. Bottom: Input images (photos by Henry Fröcklin).
  • Figure 2: Sample animal behaviour scenario from the PFERD dataset li2024poses_pferd. Top: Point cloud reconstruction obtained by first applying the proposed method and then performing bundle adjustment over all matches while keeping the estimated camera intrinsics and extrinsics fixed. Bottom: Input images. Note that the point-cloud contrast has been adjusted, and that people are removed for privacy reasons.
  • Figure 3: Example of result on the Eyeful Tower dataset VRNeRF. Top: Triangulated 3D structure, and the 10 estimated cameras. Cameras are drawn with red rays emanating from the projection centre onto a sphere representing the normalized image plane (in green). Bottom: Input images.
  • Figure 4: Example images from failure cases on the MVS dataset jensen2014large in the incremental pipeline. These scenes contain shadows that vary from frame to frame and a completely white table surface, which likely contribute to the observed reconstruction failures. In (a), the flat checkerboard also constitutes a dominant scene plane -- a known challenging case in SFM.