Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration
Zhixin Cheng, Xiaotian Yin, Jiacheng Deng, Bohao Liao, Yujia Chen, Xu Zhou, Baoqun Yin, Tianzhu Zhang
TL;DR
This work tackles image-to-point cloud registration by addressing cross-modal noise through an Adaptive Agent Selection and Interaction Network (A2SI). It introduces a Phase Map Extractor and Iterative Agents Selection (IAS) with a Tri-Stage optimization to identify reliable cross-modal agents, followed by Reliable Agents Interaction (RAI) that performs agent-guided cross-modal fusion, replacing standard dense transformer attention. The approach achieves state-of-the-art results on RGB-D Scenes v2 and 7-Scenes, with substantial gains in registration recall and robust performance in challenging scenes. The contributions offer improved accuracy and robustness for cross-modal registration and demonstrate effective reductions in attention noise and computation via agent-guided aggregation.
Abstract
Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
