Table of Contents
Fetching ...

SOFI: Multi-Scale Deformable Transformer for Camera Calibration with Enhanced Line Queries

Sebastian Janampa, Marios Pattichis

TL;DR

SOFI tackles single-image camera calibration by introducing a multi-scale deformable transformer with enhanced line queries that fuse line content with geometric information. The method enables cross-scale interaction through MSDeformAttn and a redesigned line-query scheme, leading to improved estimation of the zenith vanishing point, horizon line, and field of view, as well as robust line segmentation. Key contributions include a line-queries module with separate content and position components, a revised line-classification and confidence mechanism, and a loss formulation that emphasizes camera-parameter accuracy. Empirically, SOFI achieves state-of-the-art or competitive results on Google Street View, Horizon Line in the Wild, and Holicity while maintaining fast inference, demonstrating robust calibration in diverse and out-of-distribution scenes. This work advances end-to-end transformer-based camera calibration by enabling effective cross-scale feature interaction and richer line representations.

Abstract

Camera calibration consists of estimating camera parameters such as the zenith vanishing point and horizon line. Estimating the camera parameters allows other tasks like 3D rendering, artificial reality effects, and object insertion in an image. Transformer-based models have provided promising results; however, they lack cross-scale interaction. In this work, we introduce \textit{multi-Scale defOrmable transFormer for camera calibratIon with enhanced line queries}, SOFI. SOFI improves the line queries used in CTRL-C and MSCC by using both line content and line geometric features. Moreover, SOFI's line queries allow transformer models to adopt the multi-scale deformable attention mechanism to promote cross-scale interaction between the feature maps produced by the backbone. SOFI outperforms existing methods on the \textit {Google Street View}, \textit {Horizon Line in the Wild}, and \textit {Holicity} datasets while keeping a competitive inference speed.

SOFI: Multi-Scale Deformable Transformer for Camera Calibration with Enhanced Line Queries

TL;DR

SOFI tackles single-image camera calibration by introducing a multi-scale deformable transformer with enhanced line queries that fuse line content with geometric information. The method enables cross-scale interaction through MSDeformAttn and a redesigned line-query scheme, leading to improved estimation of the zenith vanishing point, horizon line, and field of view, as well as robust line segmentation. Key contributions include a line-queries module with separate content and position components, a revised line-classification and confidence mechanism, and a loss formulation that emphasizes camera-parameter accuracy. Empirically, SOFI achieves state-of-the-art or competitive results on Google Street View, Horizon Line in the Wild, and Holicity while maintaining fast inference, demonstrating robust calibration in diverse and out-of-distribution scenes. This work advances end-to-end transformer-based camera calibration by enabling effective cross-scale feature interaction and richer line representations.

Abstract

Camera calibration consists of estimating camera parameters such as the zenith vanishing point and horizon line. Estimating the camera parameters allows other tasks like 3D rendering, artificial reality effects, and object insertion in an image. Transformer-based models have provided promising results; however, they lack cross-scale interaction. In this work, we introduce \textit{multi-Scale defOrmable transFormer for camera calibratIon with enhanced line queries}, SOFI. SOFI improves the line queries used in CTRL-C and MSCC by using both line content and line geometric features. Moreover, SOFI's line queries allow transformer models to adopt the multi-scale deformable attention mechanism to promote cross-scale interaction between the feature maps produced by the backbone. SOFI outperforms existing methods on the \textit {Google Street View}, \textit {Horizon Line in the Wild}, and \textit {Holicity} datasets while keeping a competitive inference speed.
Paper Structure (27 sections, 13 equations, 4 figures, 6 tables)

This paper contains 27 sections, 13 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Model overview. Our proposed network uses ResNet50 resnet as backbone to extract feature maps from stage 2 and 3 which are then fed to the deformable transformer encoder.
  • Figure 2: Examples of horizon line estimation on the Google Street View gsv test set (top row), the Horizon Line in the Wild hlw test set (middle row), and the Holicity holicity test set (bottom row).
  • Figure 3: Cumulative distribution error for the horizon line on Horizon Line in the Wild hlw and Holicity test set holicity.
  • Figure 4: Decoder module. Query definition in different transformer-based model for camera calibration.