Table of Contents
Fetching ...

Towards Global Localization using Multi-Modal Object-Instance Re-Identification

Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna

TL;DR

This work addresses the gap in robust object-instance ReID for robotics by introducing DATOR, a dual-path RGB-D transformer that fuses color and depth cues to produce discriminative object embeddings. The authors also present a localization framework that builds an object-based memory from RGB-D sequences and localizes unseen views by matching object instances, using RAM, Grounding DINO, and SAM for detection and segmentation, followed by clustering and robust pose estimation with RANSAC and colored ICP. On real and synthetic indoor datasets, DATOR delivers a mean average precision of 75.18 for object ReID and an 83.01% localization success rate on TUM-RGB-D, demonstrating strong robustness to illumination and clutter. The work contributes publicly available datasets and a complete pipeline that advances perception and navigation in complex indoor environments.

Abstract

Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.

Towards Global Localization using Multi-Modal Object-Instance Re-Identification

TL;DR

This work addresses the gap in robust object-instance ReID for robotics by introducing DATOR, a dual-path RGB-D transformer that fuses color and depth cues to produce discriminative object embeddings. The authors also present a localization framework that builds an object-based memory from RGB-D sequences and localizes unseen views by matching object instances, using RAM, Grounding DINO, and SAM for detection and segmentation, followed by clustering and robust pose estimation with RANSAC and colored ICP. On real and synthetic indoor datasets, DATOR delivers a mean average precision of 75.18 for object ReID and an 83.01% localization success rate on TUM-RGB-D, demonstrating strong robustness to illumination and clutter. The work contributes publicly available datasets and a complete pipeline that advances perception and navigation in complex indoor environments.

Abstract

Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
Paper Structure (9 sections, 7 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 9 sections, 7 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview.We propose a novel dual path transformer architecture, DATOR, combining cues from both RGB and depth modalities for effective object-instance ReID. Our localization framework generates an instance based map and uses our ReID model in conjunction with it to localize unseen views.
  • Figure 2: Proposed model DATOR: The model can take paired RGB and depth images of an object, and utilize cues from both the modalities to give an embedding which can be used for object ReID.
  • Figure 3: Qualitative Analysis of DATOR.Given a query in a low-illumination scene, DATOR reidentifies the robot instance successfully, while PADE, identifies it as a different robot that is missing an overhead attachment. This gain can be attributed to DATOR's use of depth information.