Table of Contents
Fetching ...

HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset

Ron Ferens, Yosi Keller

TL;DR

HyperPose tackles domain gaps in absolute camera pose regression by embedding a hypernetwork that generates input-conditioned weights for the pose regression heads, enabling adaptive feature emphasis during inference. The approach extends to both single- and multi-scene APRs, with MS-HyperPose employing DETR-inspired transformers to process activation maps while the hypernetwork supplies regression-head weights conditioned on the input. A new Extended Cambridge Landmarks (ECL) dataset benchmarks robustness to seasonal and lighting variations, and experiments show HyperPose improves over state-of-the-art APRs on Cambridge and 7Scenes, while MS-HyperPose achieves top multi-scene results and competitive latency (≈$30.93$ ms) and model size (≈$571$ MB). The contributions include a general hypernetwork-enabled APR framework, quantitative gains across diverse datasets, and the ECL benchmark to drive development of more invariant localization methods, with open-source code and models provided. The pose is represented as $p = <\mathbf{x}, \mathbf{q}>$, where $\mathbf{x} \in \mathbb{R}^3$ and $\mathbf{q} \in \mathbb{R}^4$ encode position and orientation, respectively.

Abstract

In this work, we propose HyperPose, which utilizes hyper-networks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. This disparity degrades the precision of contemporary localization networks. To mitigate this, we advocate for incorporating hypernetworks into single-scene and multiscene camera pose regression models. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image, effectively narrowing the domain gap. Using indoor and outdoor datasets, we evaluate the HyperPose methodology across multiple established absolute pose regression architectures. We also introduce and share the Extended Cambridge Landmarks (ECL), a novel localization dataset, based on the Cambridge Landmarks dataset, showing it in multiple seasons with significantly varying appearance conditions. Our empirical experiments demonstrate that HyperPose yields notable performance enhancements for single- and multi-scene architectures. We have made our source code, pre-trained models, and the ECL dataset openly available.

HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset

TL;DR

HyperPose tackles domain gaps in absolute camera pose regression by embedding a hypernetwork that generates input-conditioned weights for the pose regression heads, enabling adaptive feature emphasis during inference. The approach extends to both single- and multi-scene APRs, with MS-HyperPose employing DETR-inspired transformers to process activation maps while the hypernetwork supplies regression-head weights conditioned on the input. A new Extended Cambridge Landmarks (ECL) dataset benchmarks robustness to seasonal and lighting variations, and experiments show HyperPose improves over state-of-the-art APRs on Cambridge and 7Scenes, while MS-HyperPose achieves top multi-scene results and competitive latency (≈ ms) and model size (≈ MB). The contributions include a general hypernetwork-enabled APR framework, quantitative gains across diverse datasets, and the ECL benchmark to drive development of more invariant localization methods, with open-source code and models provided. The pose is represented as , where and encode position and orientation, respectively.

Abstract

In this work, we propose HyperPose, which utilizes hyper-networks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. This disparity degrades the precision of contemporary localization networks. To mitigate this, we advocate for incorporating hypernetworks into single-scene and multiscene camera pose regression models. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image, effectively narrowing the domain gap. Using indoor and outdoor datasets, we evaluate the HyperPose methodology across multiple established absolute pose regression architectures. We also introduce and share the Extended Cambridge Landmarks (ECL), a novel localization dataset, based on the Cambridge Landmarks dataset, showing it in multiple seasons with significantly varying appearance conditions. Our empirical experiments demonstrate that HyperPose yields notable performance enhancements for single- and multi-scene architectures. We have made our source code, pre-trained models, and the ECL dataset openly available.
Paper Structure (11 sections, 5 equations, 3 figures, 4 tables)

This paper contains 11 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Figure \ref{['fig:baseline_apr']} illustrates a baseline APR architecture. It consists of a CNN backbone that generates a feature vector encoding the input query image, followed by regression heads that estimate the translation and orientation. Figure \ref{['fig:baseline_apr_w_hyper']} shows the extension of this baseline architecture with a hypernetwork that outputs the weights for the regression heads (translation and orientation) during inference.
  • Figure 2: MS-Transformer with hypernetwork (MS-HyperPose) - The proposed multi-scene absolute pose regression architecture using a hypernetwork. The primary network employs position and orientation Transformers, in a dual-branch architecture, to extract activation maps from the underlying convolutional backbone. The hypernetwork generates the weights for the regression head in the primary network, using the input query image. These adaptive weights, combined with the latent vectors generated by the Transformers estimate the camera pose, consisting of the spatial position (x) and angular orientation (q).
  • Figure 3: A sample from the King's College Scene in the proposed Extended Cambridge Landmarks (ECL) Dataset