Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

Matthew Hanlon; Boyang Sun; Marc Pollefeys; Hermann Blum

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

Matthew Hanlon, Boyang Sun, Marc Pollefeys, Hermann Blum

TL;DR

The paper tackles cross-device visual localization by enabling a robot to actively select viewpoints that maximize localization accuracy within a pre-existing map built from different sensing devices. It introduces a data-driven viewpoint scoring framework with two lightweight models, a MLP and a Transformer-based Viewpoint Transformer (VPT), trained via a sample-and-evaluate pipeline on SfM landmark features and DINO appearances, within a map $\mathcal{M}=(\mathcal{M}_{l},\mathcal{M}_{t})$ and prior pose $\hat{\bm{p}}$. Comprehensive experiments in simulated HM3D-based indoor scenes and real-world deployments demonstrate that the data-driven VPT approach outperforms Fisher-information-based and heuristic baselines, particularly when occlusion filtering is included, and generalizes well to real-world data. The work advances practical cross-agent localization by delivering real-time viewpoint selection (under 1s for 100 candidates on a high-end GPU) and validating a scalable framework for multi-agent and human-robot collaboration in GPS-denied environments.

Abstract

Rather than having each newly deployed robot create its own map of its surroundings, the growing availability of SLAM-enabled devices provides the option of simply localizing in a map of another robot or device. In cases such as multi-robot or human-robot collaboration, localizing all agents in the same map is even necessary. However, localizing e.g. a ground robot in the map of a drone or head-mounted MR headset presents unique challenges due to viewpoint changes. This work investigates how active visual localization can be used to overcome such challenges of viewpoint changes. Specifically, we focus on the problem of selecting the optimal viewpoint at a given location. We compare existing approaches in the literature with additional proposed baselines and propose a novel data-driven approach. The result demonstrates the superior performance of the data-driven approach when compared to existing methods, both in controlled simulation experiments and real-world deployment.

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

TL;DR

and prior pose

. Comprehensive experiments in simulated HM3D-based indoor scenes and real-world deployments demonstrate that the data-driven VPT approach outperforms Fisher-information-based and heuristic baselines, particularly when occlusion filtering is included, and generalizes well to real-world data. The work advances practical cross-agent localization by delivering real-time viewpoint selection (under 1s for 100 candidates on a high-end GPU) and validating a scalable framework for multi-agent and human-robot collaboration in GPS-denied environments.

Abstract

Paper Structure (10 sections, 5 equations, 6 figures, 2 tables)

This paper contains 10 sections, 5 equations, 6 figures, 2 tables.

Introduction
Related Work
Method
Experiments
Data Generation
Model Training
Evaluation Strategy
Real-world Deployment and Test
Results
Conclusion

Figures (6)

Figure 1: Viewpoint Selection of three methods that run visual localization at the same location with respect to the built map (landmarks in red). The passive strategy of looking forward. and a strategy inspired by Davison2002SimultaneousVision to maximize the similarity with the viewing angle towards the landmark during mapping both result in higher localization error than our data-driven viewpoint transformer (VPT) approach.
Figure 2: Difference in perspective between a head-mounted sensor rig used for mapping (left) and a ground robot (right) deployed for localization.
Figure 3: Overview of the proposed active localization approach The core of our approach is the learning-based viewpoint evaluation model. This model processes input features derived from an established Structure-from-Motion model alongside a camera viewpoint. It predicts the likelihood of the given viewpoint being effective for visual localization. In practice, when deployed, multiple viewpoints are sampled and assessed at a particular 3D location. The viewpoint that receives the highest predicted score is then chosen as the optimal one to execute for the robot.
Figure 4: Overview of the dataset. The number in brackets following the designation corresponds to the index in the Habitat-Matterport 3D dataset. A small collection of scenes lead to great generalization capability of our model, thanks to our effective data point sampling method.
Figure 5: The constructed map for real-world evaluation The landmarks point cloud (blue) is aligned with the environment mesh. Evaluated locations are shown as mini robots.
...and 1 more figures

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

TL;DR

Abstract

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (6)