Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
Fanis Mathioulakis, Gorjan Radevski, Tinne Tuytelaars
TL;DR
Eff-GRot tackles fast, generalizable rotation estimation from RGB images by casting rotation inference as a latent-space comparison task among a query and multiple reference views, all processed in a single forward pass by a transformer. The method augments reference features with learned rotation embeddings and uses a trainable mask on the query to predict a 6D rotation encoding, which is projected to a valid SO(3) rotation. Across ShapeNet, LINEMOD, and CO3D, Eff-GRot achieves strong accuracy with favorable runtime and memory footprint, outperforming fast baselines and approaching or surpassing slower, refinement-heavy methods while remaining end-to-end trainable. The empirical results are complemented by ablations on the number of references, rotation augmentation, and encoder choices, underscoring the method’s efficiency, robustness, and potential for real-time deployment in latency-sensitive settings.
Abstract
We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.
