Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

Chao Chen; Zegang Cheng; Xinhao Liu; Yiming Li; Li Ding; Ruoyu Wang; Chen Feng

Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

Chao Chen, Zegang Cheng, Xinhao Liu, Yiming Li, Li Ding, Ruoyu Wang, Chen Feng

TL;DR

This work proposes a novel self-supervised framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods in visual place recognition using deep networks and follows an iterative training paradigm.

Abstract

Visual place recognition (VPR) using deep networks has achieved state-of-the-art performance. However, most of them require a training set with ground truth sensor poses to obtain positive and negative samples of each observation's spatial neighborhood for supervised learning. When such information is unavailable, temporal neighborhoods from a sequentially collected data stream could be exploited for self-supervised training, although we find its performance suboptimal. Inspired by noisy label learning, we propose a novel self-supervised framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods. Our method follows an iterative training paradigm which alternates between: (1) representation learning with data augmentation, (2) positive set expansion to include the current feature space neighbors, and (3) positive set contraction via geometric verification. We conduct auto-labeling and generalization tests on both simulated and real datasets, with either RGB images or point clouds as inputs. The results show that our method outperforms self-supervised baselines in recall rate, robustness, and heading diversity, a novel metric we propose for VPR. Our code and datasets can be found at https://ai4ce.github.io/TF-VPR/

Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 12 figures, 7 tables)

This paper contains 17 sections, 8 equations, 12 figures, 7 tables.

Introduction
Related Work
Method
Problem setup and formulation
Initial data labeling and data augmentation
Expansion
Contraction
Loss function
Experiments
Evaluation metrics
Experiments on KITTI-360 dataset
Experiments on Habitat-Sim dataset
Conclusion
Computation Time
Autolabelling visualization
...and 2 more sections

Figures (12)

Figure 1: We introduce the first iterative approach to explore a query's spatial neighbors given its temporal neighbors. Our solution is based on the interconnections between the temporal, spatial, and feature neighborhoods in sensory data: a query’s spatial neighborhood expands from its temporal to feature neighbors, then contracts to exclude wrong neighbors, iterated in training until such neighborhoods’ convergence.
Figure 2: Overview of TF-VPR. A novel iterative method is designed for mining all query's spatial neighbors. Labeling, training, expansion, and contraction are four major steps in our approach. Labels can be refurbished by iteratively learning feature representation, adding extra verified feature positives, and eliminating false positives to achieve self-supervised VPR.
Figure 3: Need of the expansion step. On the left, training the network with only temporal neighbors limits evaluated positives to those with the same orientations as the query, missing spatial neighbors with different headings. Instead, on the right, combining data augmentation with iterative feature neighborhood expansion discover more spatial neighbors.
Figure 4: Heading diversity illustration. The angle represents the heading difference between the query and the evaluated positives. HD represents how many angular bins are covered by true positives vs. that by the ground truth. The figure gives an example of how to calculate HD. Excluding the first and last bins, $\mathcal{\tilde{P}}_{{\bf{\mathbf{q}}}_i}$ contains 10 retrieved non-temporal positives, 8 of which are true positives, and they fall into 5 different bins, while ground truth covers 6 bins, so HD is $5/6$.
Figure 5: Heading Diversity, Recall@1, Recall@5, and Recall@10 versus training epochs on the scene Drive_0000/ 00.02 from the KITTI-360 dataset(left) and the scene Reyno/R.G from Habitat-Sim dataset(right). Abbreviations: T denotes SPTM savinov2018semi. S denotes the supervised method uy2018pointnetvlad. A and F follows the notation in Sec. \ref{['sec:experiment']}.
...and 7 more figures

Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

TL;DR

Abstract

Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (12)