RADA: Robust and Accurate Feature Learning with Domain Adaptation
Jingtai He, Gehao Zhang, Tingting Liu, Songlin Du
TL;DR
RADA tackles the challenge of robust local feature learning under severe domain shifts by integrating domain adaptation supervision with a Transformer-based booster. The backbone learns multi-scale features while aligning high-level representations across domains, and the Wave Position Encoder paired with an Attention-Free Transformer integrates global context into descriptors. The framework is guided by targeted losses for detection, description, and their coupling, enabling end-to-end optimization. Empirical results on HPatches and Aachen Day-Night demonstrate improved matching accuracy and localization performance, highlighting the method's practical impact for tasks like visual localization and SfM under changing conditions.
Abstract
Recent advancements in keypoint detection and descriptor extraction have shown impressive performance in local feature learning tasks. However, existing methods generally exhibit suboptimal performance under extreme conditions such as significant appearance changes and domain shifts. In this study, we introduce a multi-level feature aggregation network that incorporates two pivotal components to facilitate the learning of robust and accurate features with domain adaptation. First, we employ domain adaptation supervision to align high-level feature distributions across different domains to achieve invariant domain representations. Second, we propose a Transformer-based booster that enhances descriptor robustness by integrating visual and geometric information through wave position encoding concepts, effectively handling complex conditions. To ensure the accuracy and robustness of features, we adopt a hierarchical architecture to capture comprehensive information and apply meticulous targeted supervision to keypoint detection, descriptor extraction, and their coupled processing. Extensive experiments demonstrate that our method, RADA, achieves excellent results in image matching, camera pose estimation, and visual localization tasks.
