PoCo: Point Context Cluster for RGBD Indoor Place Recognition

Jing Liang; Zhuo Deng; Zheming Zhou; Omid Ghasemalizadeh; Dinesh Manocha; Min Sun; Cheng-Hao Kuo; Arnie Sen

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

Jing Liang, Zhuo Deng, Zheming Zhou, Omid Ghasemalizadeh, Dinesh Manocha, Min Sun, Cheng-Hao Kuo, Arnie Sen

TL;DR

This work tackles indoor RGB-D place recognition by learning global descriptors from color and geometric cues in 3D point clouds. It introduces PoCo, an end-to-end architecture that generalizes Context of Clusters (CoCs) to 3D, enabling joint color-geometry feature extraction and dynamic clustering for robust place recognition. Trained with Circle Loss and Triplet Loss, PoCo achieves state-of-the-art Recall@1 on ScanNet-PR ($R@1=64.63\%$) and ARKit ($R@1=45.12\%$), while being $1.75\times$ faster than the leading baseline, with strong real-world validation. These results demonstrate that effective fusion of color and geometry in a point-cloud context substantially improves indoor localization reliability and efficiency, offering practical impact for robotics navigation and SLAM systems.

Abstract

We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

TL;DR

) and ARKit (

), while being

faster than the leading baseline, with strong real-world validation. These results demonstrate that effective fusion of color and geometry in a point-cloud context substantially improves indoor localization reliability and efficiency, offering practical impact for robotics navigation and SLAM systems.

Abstract

Paper Structure (11 sections, 7 equations, 8 figures, 3 tables)

This paper contains 11 sections, 7 equations, 8 figures, 3 tables.

Introduction
Related Works
Our Approach
Problem Formulation
Architecture of Our PoCo Model
Training
Experiments
Dataset and Training Settings
Implementations of Comparison and Ablation Study
Results
Conclusion, Limitations and Future Work

Figures (8)

Figure 1: As shown in the figure, the database contains several frames captured by RGB-D cameras in different rooms. The robot moves along the green trajectory and performs place recognition in real time to locate itself. When the robot moves to the red circle, it observes the blue frame and successfully finds the best-matched frame in the database.
Figure 2: PoCo Architecture. The input contains query and database frames and the model generates a global descriptor for each frame, and the descriptors are used to rank database frames by the similarities to the query frame. Note that the extractor consists of Farthest Point Sampling (FPS), an encoder, and layers of Reducers and Cluster blocks (see Fig. \ref{['fig:reducer']}, \ref{['fig:cocs']}).
Figure 3: Reducer Block. The blue circles are points in different levels. In Reducer Block, we use the Farthest Point Sampling method to downsample points $\mathbf{P}^l$ to $\mathbf{P}^{l+1}$ and aggregate the geometric and feature information of the K-nearest neighbors to the downsampled $\mathbf{P}^{l+1}$ by Eq. \ref{['eq:pointconvformer']}.
Figure 4: Cluster Block. Red points are centers downsampled from the blue points. The blue points are grouped by the similarity of w.r.t. the red centers. Then the features of the blue points are aggregated to the centers by Eq. \ref{['eq:cluster']} and then the center features are dispatched to the points by Eq. \ref{['eq:dispatch']}.
Figure 5: Qualitative Comparison in the ScanNetPR dataset: Our PoCo method outperforms other approaches in the challenging scenarios in the ScanNetPR dataset, with small overlap areas, marked by red circles, between query and positive frames.
...and 3 more figures

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

TL;DR

Abstract

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (8)