RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation
Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, Ian Reid
TL;DR
RoboHop presents a segment-based topological map where image segments serve as persistent, semantically meaningful nodes, connected via intra-image Delaunay edges and inter-image descriptor-based associations. Node descriptors are enhanced through graph convolutions, producing robust localization, while Dijkstra-based planning propagates segment tracks across the graph to form object-goal navigation sub-goals. The framework supports open-vocabulary querying by linking CLIP and language-model embeddings to map segments, enabling relational object navigation without task-specific training. Preliminary zero-shot navigation results in real-world and simulated environments demonstrate the feasibility of segment-level hopping and language-grounded planning, highlighting the potential for semantic, open-world navigation without heavy perception or policy learning. Overall, RoboHop advances open-world navigation by uniting segmentation, graph-based reasoning, and language grounding into a modular, interpretable navigation pipeline.
Abstract
Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/
