Table of Contents
Fetching ...

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei

TL;DR

EffLoc tackles the challenge of efficient single-image 6-DoF camera relocalization by designing a lightweight Vision Transformer with a memory-efficient hierarchical layout. It introduces Sequential Group Attention (SGA) and Sequential Group Heads (SGH) to diversify feature processing and enhance representation while reducing computation. Across Oxford RobotCar benchmarks, EffLoc achieves substantial gains in both accuracy and efficiency, with large reductions in FLOPs and memory usage compared to prior methods. The combination of overlap patch embedding, memory-conscious attention, and end-to-end pose regression yields a practical, scalable solution for real-world, large-scale outdoor relocalization tasks.

Abstract

Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

TL;DR

EffLoc tackles the challenge of efficient single-image 6-DoF camera relocalization by designing a lightweight Vision Transformer with a memory-efficient hierarchical layout. It introduces Sequential Group Attention (SGA) and Sequential Group Heads (SGH) to diversify feature processing and enhance representation while reducing computation. Across Oxford RobotCar benchmarks, EffLoc achieves substantial gains in both accuracy and efficiency, with large reductions in FLOPs and memory usage compared to prior methods. The combination of overlap patch embedding, memory-conscious attention, and end-to-end pose regression yields a practical, scalable solution for real-world, large-scale outdoor relocalization tasks.

Abstract

Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
Paper Structure (18 sections, 8 equations, 6 figures, 3 tables)

This paper contains 18 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A comparison between our proposed EffLoc model and other deep learning based relocalization models. The x-axis denotes Total Floating-Point Operations Per Second (FLOPs), while the y-axis represents the total number of parameters. The parameter count is labeled for each corresponding point on the graph. Our EffLoc models exhibit superior efficiency and computational complexity, attaining the lowest Flops and parameter counts.
  • Figure 2: An overview of EffLoc's hierarchical framework and its modules. The left column showcases the overall layout. The middle column highlights Sequential Group Attention Module and Sequential Group Heads (SGH). The right column details how SGH integrates outputs across heads. The bottom presents attention feature map and the pose regressor overview for feature-to-pose transformation.
  • Figure 3: Trajectories on LOOP1 (top), LOOP2 (middle), and FULL1 (bottom) of Oxford RobotCar. The black lines depict the ground truth trajectories, and the red lines represent the trajectory predictions. A yellow star denotes the starting point in each trajectory.
  • Figure 4: Saliency maps of a representative scene from the Oxford RobotCar dataset illustrate EffLoc's adeptness in guiding the attention of the lightweight transformer towards geometrically resilient objects (e.g., distant skyline and trees edges on the right). This contrasts with environmental dynamics (e.g., the road in the top figure and moving pedestrians in the bottom), as observed in comparison with AtLoc. This emphasis contributes to enhanced global localization robustness.
  • Figure 5: Convergence velocity performances between EffLoc and AtLoc. The red line (EffLoc) of rate of convergence measures the faster speed converges to the optimal solutions as the epochs increase.
  • ...and 1 more figures