Table of Contents
Fetching ...

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull

TL;DR

ConceptGraphs introduces an open-vocabulary, object-centric 3D scene graph that fuses 2D foundation-model outputs into a scalable 3D map. By combining object-based mapping, MST-guided edge reasoning, and LVLM/LLM-driven captioning and planning, it enables language-guided perception and planning for robotics with many downstream tasks. Across Replica and real-robot experiments, it demonstrates open-vocabulary object grounding, complex visual-language queries, manipulation, navigation, and map updating, with competitive accuracy and improved scalability over dense per-point methods. The framework lays groundwork for dynamic, relational scene understanding in robotics, though it faces limitations from captioning reliability and the cost of large-model inferences, motivating future enhancements in temporal dynamics and efficiency.

Abstract

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

TL;DR

ConceptGraphs introduces an open-vocabulary, object-centric 3D scene graph that fuses 2D foundation-model outputs into a scalable 3D map. By combining object-based mapping, MST-guided edge reasoning, and LVLM/LLM-driven captioning and planning, it enables language-guided perception and planning for robotics with many downstream tasks. Across Replica and real-robot experiments, it demonstrates open-vocabulary object grounding, complex visual-language queries, manipulation, navigation, and map updating, with competitive accuracy and improved scalability over dense per-point methods. The framework lays groundwork for dynamic, relational scene understanding in robotics, though it faces limitations from captioning reliability and the cost of large-model inferences, motivating future enhancements in temporal dynamics and efficiency.

Abstract

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )
Paper Structure (25 sections, 4 figures, 3 tables)

This paper contains 25 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: ConceptGraphs builds open-vocabulary 3D scene graphs. We (a) design an object-based mapping system that only assumes class-agnostic instance masks and fuses them to 3D, (b) interprets and extracts language tags for each mapped instance leveraging large vision-language models, and (c) builds a graph of object spatial relationships by leveraging priors encoded in large language models. The object-centric nature of ConceptGraphs allows easy map maintenance and promotes scalability, and the graph structure provides relational information within the scene. Furthermore, our scene graph representations are easily mapped to natural language formats to interface with LLMs, enabling them to answer complex scene queries and granting robots access to useful facts about surrounding objects, such as traversability and utility. We implement and demonstrate ConceptGraphs on a number of real-world robotics tasks across wheeled and legged mobile robot platforms. (https://concept-graphs.github.io/) (https://youtu.be/mRhNkQwRYnc)
  • Figure 2: ConceptGraphs builds an open-vocabulary 3D scene graph from a sequence of posed RGB-D images. We use generic instance segmentation models to segment regions from RGB images, extract semantic feature vectors for each, and project them to a 3D point cloud. These regions are incrementally associated and fused from multiple views, resulting in a set of 3D objects and associated vision (and language) descriptors. Then large vision and language models are used to caption each mapped 3D objects and derive inter-object relations, which generates the edges to connect the set of objects and form a graph. The resulting 3D scene graph provides a structured and comprehensive understanding of the scene and can further be easily translated to a text description, useful for LLM-based task planning.
  • Figure 3: A Jackal robot answering user queries using the ConceptGraphs representation of a lab environment. We first query an LLM to identify the most relevant object given the user query, then validate with an LVLM if the target object if is at the expected location. If not, we query the LLM again to find a likely location or container for the missing object. (Blue) When prompted with something to wear for a space party, the Jackal attempts to find a grey shirt with a NASA logo. After failing to detect the shirt at the expected location, the LLM reasons that it could likely be in the laundry bag. (Red) The Jackal searches for red and white sneakers after receiving the user query footwear for a Ronald McDonald outfit. The LLM redirects the robot to a shoe rack after failing to detect the sneakers where they initially appeared on the map.
  • Figure 4: The Jackal robot solving a traversability challenge. All paths to the goal are obstructed by objects. We query an LLM to identify which objects can be safely pushed or traversed by the robot (green) and which objects would be too heavy or hinder the robot's movement (red). The LLM relies on the ConceptGraphs node captions to make traversability predictions and we add the non-traversable objects to the Jackal costmap for path planning. The Jackal successfully reaches the goal by going through a curtain and pushing a basketball, while also avoiding contact with bricks, an iron dumbbell, and a flower pot.