PoCo: Point Context Cluster for RGBD Indoor Place Recognition
Jing Liang, Zhuo Deng, Zheming Zhou, Omid Ghasemalizadeh, Dinesh Manocha, Min Sun, Cheng-Hao Kuo, Arnie Sen
TL;DR
This work tackles indoor RGB-D place recognition by learning global descriptors from color and geometric cues in 3D point clouds. It introduces PoCo, an end-to-end architecture that generalizes Context of Clusters (CoCs) to 3D, enabling joint color-geometry feature extraction and dynamic clustering for robust place recognition. Trained with Circle Loss and Triplet Loss, PoCo achieves state-of-the-art Recall@1 on ScanNet-PR ($R@1=64.63\%$) and ARKit ($R@1=45.12\%$), while being $1.75\times$ faster than the leading baseline, with strong real-world validation. These results demonstrate that effective fusion of color and geometry in a point-cloud context substantially improves indoor localization reliability and efficiency, offering practical impact for robotics navigation and SLAM systems.
Abstract
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.
