Table of Contents
Fetching ...

Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

Fanis Mathioulakis, Gorjan Radevski, Tinne Tuytelaars

TL;DR

Eff-GRot tackles fast, generalizable rotation estimation from RGB images by casting rotation inference as a latent-space comparison task among a query and multiple reference views, all processed in a single forward pass by a transformer. The method augments reference features with learned rotation embeddings and uses a trainable mask on the query to predict a 6D rotation encoding, which is projected to a valid SO(3) rotation. Across ShapeNet, LINEMOD, and CO3D, Eff-GRot achieves strong accuracy with favorable runtime and memory footprint, outperforming fast baselines and approaching or surpassing slower, refinement-heavy methods while remaining end-to-end trainable. The empirical results are complemented by ablations on the number of references, rotation augmentation, and encoder choices, underscoring the method’s efficiency, robustness, and potential for real-time deployment in latency-sensitive settings.

Abstract

We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.

Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

TL;DR

Eff-GRot tackles fast, generalizable rotation estimation from RGB images by casting rotation inference as a latent-space comparison task among a query and multiple reference views, all processed in a single forward pass by a transformer. The method augments reference features with learned rotation embeddings and uses a trainable mask on the query to predict a 6D rotation encoding, which is projected to a valid SO(3) rotation. Across ShapeNet, LINEMOD, and CO3D, Eff-GRot achieves strong accuracy with favorable runtime and memory footprint, outperforming fast baselines and approaching or surpassing slower, refinement-heavy methods while remaining end-to-end trainable. The empirical results are complemented by ablations on the number of references, rotation augmentation, and encoder choices, underscoring the method’s efficiency, robustness, and potential for real-time deployment in latency-sensitive settings.

Abstract

We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.

Paper Structure

This paper contains 27 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Efficient and generalizable rotation estimation. Eff-GRot estimates the rotation of an unseen object by comparing it to a set of RGB reference images with known orientations. The prediction is made in a single forward pass, enabling fast real-time inference.
  • Figure 2: Model overview. Eff-GRot takes as input a set of reference images with known rotations and a query image whose rotation needs to be predicted. The encoder processes images from different viewpoints, mapping them to corresponding representations. Learned rotation embeddings are added to reference image representations, while trainable mask vectors are added to query representations. This complete set of representations is fed into a transformer model that outputs updated representations, which are then mapped to the final rotation prediction through a lightweight MLP head.
  • Figure 3: Comparison with RelPose on the CO3D dataset. While RelPose shows diminishing returns beyond 9 reference views, Eff-GRot continues to improve steadily with additional views.
  • Figure 4: Performance vs runtime tradeoff. We visualize the trade-off between performance and runtime for different methods on LINEMOD. The x-axis is shown on a logarithmic scale to better capture differences in runtime. Eff-GRot exhibits a favorable balance between runtime and performance compared to baselines.
  • Figure 5: Visualization of rotation prediction across different reference distribution scenarios for the cat object. The predicted (purple cross) and ground truth (red square) rotations are visualized for four different reference distributions. Rotations are visualized as points on the viewing sphere, representing camera locations looking toward the object at the center of the sphere. A,B) The references and query vary along a single axis (azimuth and elevation), showing that the model can correctly position the query between the references. C) The query falls within a region covered by the references, enabling successful interpolation for rotation prediction. D) The query lies far outside the reference coverage region, leading to a failure in rotation prediction.
  • ...and 3 more figures