Table of Contents
Fetching ...

Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

Shay Dekel, Yosi Keller, Martin Cadik

TL;DR

This work tackles the problem of estimating extreme 3D image rotations for image pairs with limited or no overlap. It introduces a cross-attention-based framework that replaces the traditional $4$D correlation volume with Transformer-Encoder–driven cross-attention over refined CNN feature maps, augmented by cross-decoding to distill inter-image information and a cascaded decoding scheme to refine a quaternion query for rotation regression. The model is trained end-to-end with a regression loss to predict the relative rotation quaternion $ ilde{q}$, achieving state-of-the-art accuracy on InteriorNet, StreetLearn, and SUN360 across large, small, and none overlap categories, and demonstrating strong cross-dataset generalization. Attention visualizations reveal that the Transformer-Encoder focuses on rotation-informative cues like vertical and horizontal lines, supporting the intuitive link between geometric structure and rotation inference. Overall, the approach provides a robust, generalizable alternative to $4$D correlation volumes for extreme rotation estimation and can extend to related two-image tasks such as optical flow or relative pose regression.

Abstract

The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.

Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

TL;DR

This work tackles the problem of estimating extreme 3D image rotations for image pairs with limited or no overlap. It introduces a cross-attention-based framework that replaces the traditional D correlation volume with Transformer-Encoder–driven cross-attention over refined CNN feature maps, augmented by cross-decoding to distill inter-image information and a cascaded decoding scheme to refine a quaternion query for rotation regression. The model is trained end-to-end with a regression loss to predict the relative rotation quaternion , achieving state-of-the-art accuracy on InteriorNet, StreetLearn, and SUN360 across large, small, and none overlap categories, and demonstrating strong cross-dataset generalization. Attention visualizations reveal that the Transformer-Encoder focuses on rotation-informative cues like vertical and horizontal lines, supporting the intuitive link between geometric structure and rotation inference. Overall, the approach provides a robust, generalizable alternative to D correlation volumes for extreme rotation estimation and can extend to related two-image tasks such as optical flow or relative pose regression.

Abstract

The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.
Paper Structure (15 sections, 5 equations, 4 figures, 6 tables)

This paper contains 15 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The estimation of extreme 3D image rotations. First row: Images pair with a small overlap. Second row: non-overlapping image pairs. The proposed scheme estimates the relative rotation between image pairs.
  • Figure 2: The proposed architecture utilizes weight-sharing Siamese CNNs to encode the input image pair $(I_{1},I_{2})\in \mathbb{R}^{H\times W}$ into feature maps $(\hat{I}_{1},\hat{I}_{2})$. These feature maps are then cross-decoded by the weight sharing Transformer Decoder-0 layers, cross-distilling $(\hat{I}_{1},\hat{I}_{2})$ into the representations $\Bar{\Bar{I}}_{1}$ and $\Bar{\Bar{I}}_{2}$. The concatenated refined embeddings $T$ are input to the Transformer-Encoder alongside an attention mask $M$ to derive the cross-attention encoding $\hat{T}$. $\hat{T}$ enters a cascade of two Transformer Decoders, where the first, Transformer Decoder-1, enhances the cross-attention as $\Bar{\Bar{T}}$, guided by the learned quaternion rotation query $\bar{q}$. The second, Transformer Decoder-2, encodes the rotation as $\Bar{\Bar{q}}$, transformed via a multilayer perceptron (MLP) to predict the relative quaternion rotation $\Tilde{q}$.
  • Figure 3: Computing the cross-attention using a Transformer-Encoder and the input mask, $\mathbf{M.}$ The mask $\mathbf{M}$ zeros the self-attention terms, retaining only the cross-attention terms.
  • Figure 4: Rotation estimation results. The panoramic and cropped groundtruth images are marked by green and yellow-dot lines. The predicted footprint of one of the cropped images is marked by the red-dot line. The first row shows the matching results of images with large overlaps. The second and last rows show the matching of small overlap and non-overlapping images.