Table of Contents
Fetching ...

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Luca Di Giammarino, Boyang Sun, Giorgio Grisetti, Marc Pollefeys, Hermann Blum, Daniel Barath

TL;DR

This work tackles active localization by learning where to look: it defines a scoring function $f_{\mathcal{P}}(\mathbf{R},\mathbf{t})$ that evaluates the localization quality of candidate viewpoints. A compact, real-time encoder operates on a geometry-driven map built from a voxel-location grid and spherical Fibonacci orientations, with a self-supervised training loop that labels viewpoints using COLMAP-based pose verification against simulator-generated data. The map supports multiple valid viewpoints per location and can be embedded into planning, enabling planners to choose viewpoints that balance path cost and localization accuracy; experiments show improvements over Fisher-information baselines and real-time inference (~$0.02$ s) on indoor-like scenes, with the approach generalizing from synthetic to real data and an open-source release for the community. The results underscore the importance of the image-landmark distribution and 3D geometric information for robust active localization in robotics applications.

Abstract

Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one, targeting similar problems and generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

TL;DR

This work tackles active localization by learning where to look: it defines a scoring function that evaluates the localization quality of candidate viewpoints. A compact, real-time encoder operates on a geometry-driven map built from a voxel-location grid and spherical Fibonacci orientations, with a self-supervised training loop that labels viewpoints using COLMAP-based pose verification against simulator-generated data. The map supports multiple valid viewpoints per location and can be embedded into planning, enabling planners to choose viewpoints that balance path cost and localization accuracy; experiments show improvements over Fisher-information baselines and real-time inference (~ s) on indoor-like scenes, with the approach generalizing from synthetic to real data and an open-source release for the community. The results underscore the importance of the image-landmark distribution and 3D geometric information for robust active localization in robotics applications.

Abstract

Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one, targeting similar problems and generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.
Paper Structure (16 sections, 4 equations, 7 figures, 3 tables)

This paper contains 16 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Pipeline. Given a model, we aim to learn the camera viewpoint that can be employed to maximize the accuracy in visual localization. Our methodology requires first sampling the camera locations and orientation, calculating the best visibility orientation for each location, and learning active viewpoint through a encoder. The illustration above shows our full pipeline predicting active viewpoints for visual localization embedded into a planning framework.
  • Figure 2: Learning active viewpoints. Given a set of camera poses parameterized as homogeneous transformation matrices, obtained as explained in Sec. \ref{['sec:sampling']} and visibility information (3D landmarks and their projections), our goal is to develop a scoring function that discerns the suitability of a camera viewpoint for visual localization. We first identify visible data from each camera view to achieve this, as elaborated in Sec. \ref{['sec:visibility']}. Subsequently, we encode this visible data through image binning for a fixed input size. The encoded information is then fed into a encoder, which predicts the quality of the viewpoint for localization. This learning process, detailed in Sec. \ref{['sec:nn']}, is supervised by consistently providing the camera position, querying an RGB image through a simulator, and directly registering this image against a model.
  • Figure 3: Camera viewpoint generation. We represent our map as a discrete voxel grid $\mathcal{V}$ and a discrete set of orientations $\mathcal{R}$ constructed within the boundaries of a 3D reconstruction, e.g., coming from a method. We filter the best directions from each camera location in the voxel grid based on visibility $\mathcal{Q}$, gradually removing occlusions. The illustration is done in 2D for ease of visualization.
  • Figure 4: Spherical sampling methods. The left plot shows classical azimuth-elevation sampling. The right one is Fibonacci sampling. We employed the technique on the right, given the more uniform distributed pattern.
  • Figure 5: Qualitative planning experiments with self-recorded data. In this setup, encounters challenges in predicting certain viewpoints, leading to failures in our planner. The conventional camera look-forward approach, a traditional method, neglects the consideration of active viewpoints. Its limitations become apparent as it neglects camera adjustments toward regions with higher landmark density, potentially directing it toward featureless areas. We specifically use a noisy model to show the robustness and adaptability of our approach, reflecting real case experiments.
  • ...and 2 more figures