EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization
Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei
TL;DR
EffLoc tackles the challenge of efficient single-image 6-DoF camera relocalization by designing a lightweight Vision Transformer with a memory-efficient hierarchical layout. It introduces Sequential Group Attention (SGA) and Sequential Group Heads (SGH) to diversify feature processing and enhance representation while reducing computation. Across Oxford RobotCar benchmarks, EffLoc achieves substantial gains in both accuracy and efficiency, with large reductions in FLOPs and memory usage compared to prior methods. The combination of overlap patch embedding, memory-conscious attention, and end-to-end pose regression yields a practical, scalable solution for real-world, large-scale outdoor relocalization tasks.
Abstract
Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
