Table of Contents
Fetching ...

NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation

Diwei Sheng, Anbang Yang, John-Ross Rizzo, Chen Feng

TL;DR

This work tackles long-term indoor visual place recognition by presenting NYC-Indoor-VPR, a year-long dataset with over 36k images across 13 crowded indoor scenes in New York City. It introduces a semi-automatic ground-truth annotation pipeline that derives topometric frame locations from paired video trajectories, enabling precise VPR benchmarking without full 3D reconstructions. Benchmark experiments across six state-of-the-art VPR methods reveal substantial challenges posed by indoor dynamics, perceptual aliasing, and occlusions, highlighting the dataset's value for driving advances in indoor VPR. The dataset and annotation tools are publicly available to support future research and method development in indoor localization and navigation.

Abstract

Visual Place Recognition (VPR) in indoor environments is beneficial to humans and robots for better localization and navigation. It is challenging due to appearance changes at various frequencies, and difficulties of obtaining ground truth metric trajectories for training and evaluation. This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled from 13 distinct crowded scenes in New York City taken under varying lighting conditions with appearance changes. Each scene has multiple revisits across a year. To establish the ground truth for VPR, we propose a semiautomatic annotation approach that computes the positional information of each image. Our method specifically takes pairs of videos as input and yields matched pairs of images along with their estimated relative locations. The accuracy of this matching is refined by human annotators, who utilize our annotation software to correlate the selected keyframes. Finally, we present a benchmark evaluation of several state-of-the-art VPR algorithms using our annotated dataset, revealing its challenge and thus value for VPR research.

NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation

TL;DR

This work tackles long-term indoor visual place recognition by presenting NYC-Indoor-VPR, a year-long dataset with over 36k images across 13 crowded indoor scenes in New York City. It introduces a semi-automatic ground-truth annotation pipeline that derives topometric frame locations from paired video trajectories, enabling precise VPR benchmarking without full 3D reconstructions. Benchmark experiments across six state-of-the-art VPR methods reveal substantial challenges posed by indoor dynamics, perceptual aliasing, and occlusions, highlighting the dataset's value for driving advances in indoor VPR. The dataset and annotation tools are publicly available to support future research and method development in indoor localization and navigation.

Abstract

Visual Place Recognition (VPR) in indoor environments is beneficial to humans and robots for better localization and navigation. It is challenging due to appearance changes at various frequencies, and difficulties of obtaining ground truth metric trajectories for training and evaluation. This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled from 13 distinct crowded scenes in New York City taken under varying lighting conditions with appearance changes. Each scene has multiple revisits across a year. To establish the ground truth for VPR, we propose a semiautomatic annotation approach that computes the positional information of each image. Our method specifically takes pairs of videos as input and yields matched pairs of images along with their estimated relative locations. The accuracy of this matching is refined by human annotators, who utilize our annotation software to correlate the selected keyframes. Finally, we present a benchmark evaluation of several state-of-the-art VPR algorithms using our annotated dataset, revealing its challenge and thus value for VPR research.
Paper Structure (8 sections, 7 figures, 3 tables)

This paper contains 8 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Trajectories annotated by our semi-automatic method and example images of 12 scenes in NYC-Indoor-VPR.
  • Figure 2: Comparison of annotation methods for a video (pair) visiting Oculus. COLMAP fails to accurately reconstruct. Visual SLAM can generate a trajectory, but cannot match two trajectories. Our annotation method accurately computes the relative location of each frame in a video pair.
  • Figure 3: Our dataset is collected over a 1-year time span.
  • Figure 4: Overview of our semi-automatic annotation method. We collect two videos of the same route at different time. We use visual SLAM to identify keyframes with topometric locations. We automatically detect turning points (marked in green) and match them manually. We match the trajectory pairs and generate frame pairs with ground-truth topometric locations.
  • Figure 5: Raw image vs. Anonymized image
  • ...and 2 more figures