ROMAN: Open-Set Object Map Alignment for Robust View-Invariant Global Localization
Mason B. Peterson, Yixuan Jia, Yulun Tian, Annika Thomas, Jonathan P. How
TL;DR
ROMAN tackles global localization under drastic viewpoint changes by building open-set object maps and aligning them with a gravity-aware, graph-based data association that fuses semantic (CLIP-based) and geometric (shape/volume) cues. The method introduces a unified submap alignment framework and enhances the affinity metrics with metric-semantic attributes and a gravity prior, enabling reliable associations even when maps are created from opposite-facing routes. The authors demonstrate substantial improvements over segment-based and image-based baselines across indoor, urban, and off-road scenarios, including up to 45% improvement in relative pose estimation and up to 35% reduction in trajectory error in challenging multi-robot SLAM sequences. ROMAN also achieves robust cross-view localization and scalable, communication-efficient object maps, highlighting its practical impact for drift-free navigation and collaborative SLAM in diverse environments.
Abstract
Global localization is a fundamental capability required for long-term and drift-free robot navigation. However, current methods fail to relocalize when faced with significantly different viewpoints. We present ROMAN (Robust Object Map Alignment Anywhere), a global localization method capable of localizing in challenging and diverse environments by creating and aligning maps of open-set and view-invariant objects. ROMAN formulates and solves a registration problem between object submaps using a unified graph-theoretic global data association approach with a novel incorporation of a gravity direction prior and object shape and semantic similarity. This work's open-set object mapping and information-rich object association algorithm enables global localization, even in instances when maps are created from robots traveling in opposite directions. Through a set of challenging global localization experiments in indoor, urban, and unstructured/forested environments, we demonstrate that ROMAN achieves higher relative pose estimation accuracy than other image-based pose estimation methods or segment-based registration methods. Additionally, we evaluate ROMAN as a loop closure module in large-scale multi-robot SLAM and show a 35% improvement in trajectory estimation error compared to standard SLAM systems using visual features for loop closures. Code and videos can be found at https://acl.mit.edu/roman.
