Table of Contents
Fetching ...

Extreme Rotation Estimation in the Wild

Hana Bezalel, Dotan Ankri, Ruojin Cai, Hadar Averbuch-Elor

TL;DR

This work tackles estimating the relative 3D rotation between non-overlapping real-world images by introducing a Transformer-based Rotation Estimation Transformer that leverages LoFTR features and auxiliary channels. It introduces the ExtremeLandmarkPairs dataset to benchmark extreme-view rotations in the wild and presents a progressive training pipeline that starts from panorama crops and extends to real Internet data with FoV and appearance augmentations. The approach achieves state-of-the-art performance on non-overlapping wild pairs while remaining competitive on panorama-cropped overlaps, highlighting the value of real-world data and multi-modal cues for robust pose estimation. The dataset and methodology have practical implications for camera localization and large-scale 3D reconstruction in unconstrained settings, with room for further improvements via enhanced augmentations and multi-view extensions.

Abstract

We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.

Extreme Rotation Estimation in the Wild

TL;DR

This work tackles estimating the relative 3D rotation between non-overlapping real-world images by introducing a Transformer-based Rotation Estimation Transformer that leverages LoFTR features and auxiliary channels. It introduces the ExtremeLandmarkPairs dataset to benchmark extreme-view rotations in the wild and presents a progressive training pipeline that starts from panorama crops and extends to real Internet data with FoV and appearance augmentations. The approach achieves state-of-the-art performance on non-overlapping wild pairs while remaining competitive on panorama-cropped overlaps, highlighting the value of real-world data and multi-modal cues for robust pose estimation. The dataset and methodology have practical implications for camera localization and large-scale 3D reconstruction in unconstrained settings, with room for further improvements via enhanced augmentations and multi-view extensions.

Abstract

We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.

Paper Structure

This paper contains 26 sections, 3 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Given a pair of (possibly) non-overlapping images captured in the wild---e.g., under arbitrary illumination and intrinsic camera parameters---such as the images of the Dam Square in Amsterdam depicted in red and blue boxes above$^*$, our technique estimates the relative 3D rotation between the images. $^*$The background panorama is illustrated for visualization purposes only.
  • Figure 2: Camera distribution of the Vatican, Rome scene from the ExtremeLandmarkPairs Dataset. We construct a dataset of real perspective image pairs with predominant rotational motion shown in (b) and (c) from the dense imagery reconstruction in (a).
  • Figure 3: Method architecture. Given a pair of input Internet images, we extract image features using pretrained LoFTR. These features are combined with auxiliary channels, including keypoint and pairwise matches masks, and segmentation maps (visualized on the bottom left). These image features are reshaped into tokens and concatenated with Euler angle position embeddings, which are then processed by our Rotation Estimation Transformer module. The output Euler angle tokens and averaged image tokens are concatenated and processed by MLPs to predict the probability distribution of Euler angles, representing the relative 3D rotation between the input images.
  • Figure 4: Augmenting perspective images cropped from panoramic views with image-level appearance modifications. Given an input image (left) and a target text prompt "Make it $\left<w\right>$" ($\left<w\right>$ is specified above), we use a conditional Diffusion model brooks2023instructpix2pix to create semantic appearance augmentations which modify both the global image characteristics as well as local image regions.
  • Figure 5: Qualitative results on the wELP test set. We visualize the results of our model over different overlap levels, where the images on the left serve as the reference points, and their coordinate system determines the relative rotation, which defines the images on the right. The ellipsoids representing the ground truth are color-coded to match their respective images, with the estimated relative rotation illustrated by a cyan dashed line. As illustrated by the examples above, our method can accurately predict relative rotations for diverse image pairs containing varying appearances and intrinsic parameters. Please refer to the supplementary material for additional qualitative results.
  • ...and 5 more figures