SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
Niklas Gard, Anna Hilsmann, Peter Eisert
TL;DR
SPVLoc tackles indoor 6D localization in unseen environments by cross‑domain matching of a perspective RGB image to semantically annotated panoramas rendered from a minimalist 3D model. A RetinaNet‑style viewport predictor, depth‑wise feature correlation, and a Pose head enable end‑to‑end estimation of the 6DoF pose from the best panorama, with refinement possible via re‑rendered views. The method relies on synthetic semantic panoramas to bridge domain gaps and uses a multi‑task loss with learned uncertainty to balance viewport and pose objectives. Across Structured3D and Zillow Indoor datasets, SPVLoc achieves superior 6DoF localization accuracy with sparse panoramas, generalizes to unseen scenes, and demonstrates practical efficiency suitable for large indoor environments and potential AR applications.
Abstract
In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. Our source code is publicly available at https://fraunhoferhhi.github.io/spvloc .
