Estimating Extreme 3D Image Rotation with Transformer Cross-Attention
Shay Dekel, Yosi Keller, Martin Cadik
TL;DR
This work tackles the problem of estimating extreme 3D image rotations for image pairs with limited or no overlap. It introduces a cross-attention-based framework that replaces the traditional $4$D correlation volume with Transformer-Encoder–driven cross-attention over refined CNN feature maps, augmented by cross-decoding to distill inter-image information and a cascaded decoding scheme to refine a quaternion query for rotation regression. The model is trained end-to-end with a regression loss to predict the relative rotation quaternion $ ilde{q}$, achieving state-of-the-art accuracy on InteriorNet, StreetLearn, and SUN360 across large, small, and none overlap categories, and demonstrating strong cross-dataset generalization. Attention visualizations reveal that the Transformer-Encoder focuses on rotation-informative cues like vertical and horizontal lines, supporting the intuitive link between geometric structure and rotation inference. Overall, the approach provides a robust, generalizable alternative to $4$D correlation volumes for extreme rotation estimation and can extend to related two-image tasks such as optical flow or relative pose regression.
Abstract
The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.
