Table of Contents
Fetching ...

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, Francis Engelmann

TL;DR

This work addresses open-vocabulary functional reasoning in 3D indoor scenes by predicting a Functional 3D Scene Graph $\mathcal{G}=(\mathcal{O},\mathcal{I},\mathcal{R})$ that includes objects $\mathcal{O}$, interactive elements $\mathcal{I}$, and functional relations $\mathcal{R}$ between them. The proposed OpenFunGraph pipeline leverages foundation models (VLMs and LLMs) to detect nodes, generate descriptive language, and perform sequential reasoning to infer both local and remote functional edges, enabling rich interaction modeling. To support evaluation, the authors introduce FunGraph3D, a high-fidelity real-world dataset with laser scans, multi-sensor video, egocentric data, and ground-truth functional graphs, along with annotated SceneFun3D extensions. Experimental results show significant improvements over adapted baselines (e.g., Open3DSG, ConceptGraph) on node and triplet recall, and the method enables practical downstream tasks such as 3D inventory question answering and robotic manipulation, highlighting the practical impact for robotics and scene understanding in real environments. The work demonstrates that integrating open-vocabulary perception with language-based reasoning can yield robust, extensible functional representations that support complex reasoning and task execution in real-world spaces.

Abstract

We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at https://openfungraph.github.io

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

TL;DR

This work addresses open-vocabulary functional reasoning in 3D indoor scenes by predicting a Functional 3D Scene Graph that includes objects , interactive elements , and functional relations between them. The proposed OpenFunGraph pipeline leverages foundation models (VLMs and LLMs) to detect nodes, generate descriptive language, and perform sequential reasoning to infer both local and remote functional edges, enabling rich interaction modeling. To support evaluation, the authors introduce FunGraph3D, a high-fidelity real-world dataset with laser scans, multi-sensor video, egocentric data, and ground-truth functional graphs, along with annotated SceneFun3D extensions. Experimental results show significant improvements over adapted baselines (e.g., Open3DSG, ConceptGraph) on node and triplet recall, and the method enables practical downstream tasks such as 3D inventory question answering and robotic manipulation, highlighting the practical impact for robotics and scene understanding in real environments. The work demonstrates that integrating open-vocabulary perception with language-based reasoning can yield robust, extensible functional representations that support complex reasoning and task execution in real-world spaces.

Abstract

We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at https://openfungraph.github.io

Paper Structure

This paper contains 33 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Functional 3D Scene Graphs. Given an input sequence of posed RGB-D frames of an indoor environment, our method predicts a functional 3D scene graph by detecting objects, identifying interactive elements, and inferring functional relationships. This enables the representation of interactions, functions, and scene dynamics, going beyond existing 3D scene graph methods that are constrained to spatial relationships between static objects.
  • Figure 2: Illustration of the OpenFunGraph architecture. Given a sequence of posed RGB-D frames $\{(\mathcal{I}_i, \mathcal{D}_i)\}_{i=1}^{n}$, we use RAM++ zhang2024recognize and GroundingDINO liu2023grounding to detect and segment objects ${\color{RoyalBlue}\pmb{\mathcal{O}}}{}$ and interactive elemens ${\color{ForestGreen}\pmb{\mathcal{I}}}{}$, forming the node candidates of the functional 3D scene graph. Next, a mechanism using the large language model (LLM) GPT achiam2023gpt and the visual language model (VLM) LLAVA liu2024llava generates natural language descriptions $\mathcal{L}$ for each node. Finally, we infer functional relationships ${\color{orange}\pmb{\mathcal{R}}}{}$ between objects ${\color{RoyalBlue}\pmb{\mathcal{O}}}{}$ and interactive elements ${\color{ForestGreen}\pmb{\mathcal{I}}}{}$, represented as the edges in the functional 3D scene graph ${\color{BrickRed}\pmb{\mathcal{G}}}{}$.
  • Figure 3: Modalities of our FunGraph3D dataset.Top: 3D scans from a Faro laser scanner, annotated with 3D object and interactive element masks. Middle: Ground truth functional 3D scene graphs. Bottom: Egocentric video capturing human-scene interactions.
  • Figure 4: Example scenes from our FunGraph3D dataset. The dataset includes typical indoor environments such as living rooms, bedrooms, bathrooms, and kitchens.
  • Figure 5: Qualitative results.Top: input images. Bottom: predicted functional 3D scene graph. Best seen zoomed in on a color screen.
  • ...and 1 more figures