Table of Contents
Fetching ...

OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments

Yinan Deng, Jiahui Wang, Jingyu Zhao, Xinyu Tian, Guangyan Chen, Yi Yang, Yufeng Yue

TL;DR

OpenGraph addresses the challenge of outdoor-scale open-vocabulary mapping by introducing a three-module pipeline that uses caption-enhanced perception, object-centric 3D mapping, and a hierarchical lane-aware graph. By projecting 2D caption features into 3D space and encoding captions with LLMs, it achieves robust zero-shot semantic understanding and open-vocabulary retrieval in large outdoor environments. The hierarchical graph enables efficient maintenance and structured queries, including interactive map updates and path planning on the lane graph. Empirical results on SemanticKITTI show competitive segmentation performance and superior open-vocabulary retrieval compared to baselines, underscoring the practical impact for robotics and outdoor scene understanding.

Abstract

Environment representations endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including zero-shot learning and support for open-set classes. However, existing open-vocabulary maps are primarily designed for small-scale environments, such as desktops or rooms, and are typically geared towards limited-area tasks involving robotic indoor navigation or in-place manipulation. They face challenges in direct generalization to outdoor environments characterized by numerous objects and complex tasks, owing to limitations in both understanding level and map structure. In this work, we propose OpenGraph, the first open-vocabulary hierarchical graph representation designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images, enhancing textual reasoning by encoding them. Subsequently, it achieves 3D incremental object-centric mapping with feature embedding by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from public dataset SemanticKITTI demonstrate that OpenGraph achieves the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at https://github.com/BIT-DYN/OpenGraph.

OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments

TL;DR

OpenGraph addresses the challenge of outdoor-scale open-vocabulary mapping by introducing a three-module pipeline that uses caption-enhanced perception, object-centric 3D mapping, and a hierarchical lane-aware graph. By projecting 2D caption features into 3D space and encoding captions with LLMs, it achieves robust zero-shot semantic understanding and open-vocabulary retrieval in large outdoor environments. The hierarchical graph enables efficient maintenance and structured queries, including interactive map updates and path planning on the lane graph. Empirical results on SemanticKITTI show competitive segmentation performance and superior open-vocabulary retrieval compared to baselines, underscoring the practical impact for robotics and outdoor scene understanding.

Abstract

Environment representations endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including zero-shot learning and support for open-set classes. However, existing open-vocabulary maps are primarily designed for small-scale environments, such as desktops or rooms, and are typically geared towards limited-area tasks involving robotic indoor navigation or in-place manipulation. They face challenges in direct generalization to outdoor environments characterized by numerous objects and complex tasks, owing to limitations in both understanding level and map structure. In this work, we propose OpenGraph, the first open-vocabulary hierarchical graph representation designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images, enhancing textual reasoning by encoding them. Subsequently, it achieves 3D incremental object-centric mapping with feature embedding by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from public dataset SemanticKITTI demonstrate that OpenGraph achieves the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at https://github.com/BIT-DYN/OpenGraph.
Paper Structure (15 sections, 7 equations, 7 figures, 2 tables)

This paper contains 15 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The framework of OpenGraph consists of three main modules: Caption-Enhanced Object Comprehension, Object-Centric Map Construction, and Hierarchical Graph Representation Formation.
  • Figure 2: We employ three sequential visual language models for image instance segmentation and caption extraction. These models sequentially perform recognition, detection, simultaneous segmentation, and description generation of objects within the input image.
  • Figure 3: We extract the lane graph $\mathcal{M}_{lg}$ from historical trajectories $P^{(1:t)}$, whose nodes are derived from vector pinch angle ${{\Theta }^{(n)}}$ (breakpoints) and local disfluency ${{\lambda }^{(n)}}$ (inflection points or intersections).
  • Figure 4: Semantic segmentation results on the SemanticKITTI dataset, utilizing 19 classes, indicate that despite not undergoing fine-tuning, OpenGraph demonstrates higher segmentation accuracy and reduced noise levels.
  • Figure 5: The outcomes from various open-vocabulary text queries (displaying the Top-3 objects). In the visual representation, OpenGraph-LLM highlights the Top-3 objects in red, green, and blue, while the other methods render all objects based on relevance. The text beneath each retrieved object in the figure corresponds to its actual category, where green signifies successful retrieval and red indicates failure.
  • ...and 2 more figures