Table of Contents
Fetching ...

RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation

Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, Ian Reid

TL;DR

RoboHop presents a segment-based topological map where image segments serve as persistent, semantically meaningful nodes, connected via intra-image Delaunay edges and inter-image descriptor-based associations. Node descriptors are enhanced through graph convolutions, producing robust localization, while Dijkstra-based planning propagates segment tracks across the graph to form object-goal navigation sub-goals. The framework supports open-vocabulary querying by linking CLIP and language-model embeddings to map segments, enabling relational object navigation without task-specific training. Preliminary zero-shot navigation results in real-world and simulated environments demonstrate the feasibility of segment-level hopping and language-grounded planning, highlighting the potential for semantic, open-world navigation without heavy perception or policy learning. Overall, RoboHop advances open-world navigation by uniting segmentation, graph-based reasoning, and language grounding into a modular, interpretable navigation pipeline.

Abstract

Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/

RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation

TL;DR

RoboHop presents a segment-based topological map where image segments serve as persistent, semantically meaningful nodes, connected via intra-image Delaunay edges and inter-image descriptor-based associations. Node descriptors are enhanced through graph convolutions, producing robust localization, while Dijkstra-based planning propagates segment tracks across the graph to form object-goal navigation sub-goals. The framework supports open-vocabulary querying by linking CLIP and language-model embeddings to map segments, enabling relational object navigation without task-specific training. Preliminary zero-shot navigation results in real-world and simulated environments demonstrate the feasibility of segment-level hopping and language-grounded planning, highlighting the potential for semantic, open-world navigation without heavy perception or policy learning. Overall, RoboHop advances open-world navigation by uniting segmentation, graph-based reasoning, and language grounding into a modular, interpretable navigation pipeline.

Abstract

Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/
Paper Structure (22 sections, 2 equations, 8 figures, 1 table)

This paper contains 22 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: We present a topological, segment-based map representation which can generate navigation plans from open-vocabulary queries in the form of 'hops' over segments to reach the goal, without needing a learned policy.
  • Figure 2: Illustration of our overall pipeline from image segments to mapping, language querying, and planning.
  • Figure 3: Target Object Search Based on Relational Natural Language Queries: The LLM parses a relational query into a reference and target node textual description suitable for CLIP to process into language feature vectors. We then retrieve top-3 candidate target and reference nodes from the map by respectively matching the CLIP language feature vector with the CLIP vision feature vector of each node. Within the topological graph of our map, Dijkstra's algorithm finally selects the object goal for navigation based on the shortest path between the candidate target and reference nodes.
  • Figure 4: Object Instance Recognition in GibsonEnv xiazamirhe2018gibsonenv: The rows show segment masks (in green) for the query, DINO match, and CLIP match respectively. Symbols (✓/×) adjacent to images indicate success or failure in association. The final column illustrates category-level recognition success despite both methods failing at the instance level (multiple chairs in close proximity).
  • Figure 5: Node-level localization across varying number of graph convolutional layers (y-axis) and incremental inclusion of inter-image edges based on a similarity threshold (x-axis) for DINO (left) and DINOv2 (right).
  • ...and 3 more figures