Table of Contents
Fetching ...

From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

Zekun Qian, Ruize Han, Wei Feng, Feifan Wang, Song Wang

TL;DR

This work tackles the challenging problem of joint camera and subject registration in BEV without explicit camera calibration. It introduces an end-to-end framework that alternates between BEV-based subject localization (via VTM) and geometric camera pose estimation (via SAM), followed by a geometry- and appearance-driven registration/fusion process, augmented by self-supervised subject association. The authors produce a large synthetic CSRD dataset and demonstrate strong cross-view and cross-domain performance, including real-world evaluation, with ablations confirming the contributions of pretrained components and orientation supervision. The approach eliminates the need for BEV inputs or calibration data in many practical scenarios, enabling robust multi-view human localization and camera localization in a unified BEV.

Abstract

We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.

From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

TL;DR

This work tackles the challenging problem of joint camera and subject registration in BEV without explicit camera calibration. It introduces an end-to-end framework that alternates between BEV-based subject localization (via VTM) and geometric camera pose estimation (via SAM), followed by a geometry- and appearance-driven registration/fusion process, augmented by self-supervised subject association. The authors produce a large synthetic CSRD dataset and demonstrate strong cross-view and cross-domain performance, including real-world evaluation, with ablations confirming the contributions of pretrained components and orientation supervision. The approach eliminates the need for BEV inputs or calibration data in many practical scenarios, enabling robust multi-view human localization and camera localization in a unified BEV.

Abstract

We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
Paper Structure (24 sections, 7 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 7 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: An illustration of the multi-view camera and subject registration problem.
  • Figure 2: Framework of the proposed method, which can be divided into three parts, i.e., VTM, SAM and Registration. We use hollow camera icons to represent registered cameras and filled camera icons to represent unregistered cameras.
  • Figure 3: The structure of LocoNet. Here $\times 512$ means the 512 feature channels, Fc means fully connected layer, BN means batch normalization layer, ReLU is an activation function, DP0.2 is the dropout layer with ratio $0.2$.
  • Figure 4: An illustration of the rotation and translation transformation for two BEVs with a matching pair.
  • Figure 5: Candidate camera poses in the coordinate system.
  • ...and 10 more figures