Table of Contents
Fetching ...

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

Jared Strader, Nathan Hughes, William Chen, Alberto Speranzon, Luca Carlone

TL;DR

This letter develops methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation using a Large Language Model (LLM) and leverages the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules.

Abstract

This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

TL;DR

This letter develops methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation using a Large Language Model (LLM) and leverages the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules.

Abstract

This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.
Paper Structure (21 sections, 8 equations, 6 figures, 6 tables)

This paper contains 21 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: 3D scene graph of a indoor-outdoor environment at West Point, NY, constructed using our approach. The hierarchical structure of the graph is based on a spatial ontology generated from Large Language Models, and grounded into a 3D map using Logic Tensor Networks. A satellite image is shown of the area overlaid with the constructed 3D scene graph.
  • Figure 2: Text scoring approach for generating spatial ontology. The language model assigns a loss to the text string between low-level and high-level labels. The loss is rescaled using a softmax to assign weights to the edges, then the lowest weighted edges are pruned away.
  • Figure 3: Text completion approach for generating spatial ontology. The language model is queried asking what low-level labels distinguish high-level labels. The response is parsed to generate the spatial ontology.
  • Figure 4: Proposed architecture for learning and inference.
  • Figure 5: (a) Results of the predicate ablation for the MP3D dataset. (b) Results of the predicate ablation for the West Point dataset. (c) Results of the predicate ablation for the Castle Island dataset. For all experiments, we run 10 trials of each loss configuration for each percentage of the training data. The mean of the 10 trials is shown as a dashed line, and the shaded area shows the standard deviation.
  • ...and 1 more figures