Table of Contents
Fetching ...

TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

Wenting Xu, Viorela Ila, Luping Zhou, Craig T. Jin

TL;DR

The paper tackles context-dependent object affordances in 3D indoor scenes by introducing TB-HSU, a transformer-based method that constructs a three-layer 3D Hierarchical Scene Graph (3DHSG) consisting of Objects, Regions, and Rooms. It jointly learns room classification and region affordances through a multi-task objective, and introduces the 3DHSG dataset to capture region- and object-specific affordances built upon 3RScan/3DSSG foundations. Across multiple benchmarks, TB-HSU outperforms diverse baselines, and the 3DHSG representation enhances open-vocabulary reasoning in GPT-4o for tasks like locating non-visible objects. The work demonstrates that incorporating spatial context into affordances improves 3D scene understanding and can bolster LLM-driven reasoning for task planning and navigation.

Abstract

The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.

TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

TL;DR

The paper tackles context-dependent object affordances in 3D indoor scenes by introducing TB-HSU, a transformer-based method that constructs a three-layer 3D Hierarchical Scene Graph (3DHSG) consisting of Objects, Regions, and Rooms. It jointly learns room classification and region affordances through a multi-task objective, and introduces the 3DHSG dataset to capture region- and object-specific affordances built upon 3RScan/3DSSG foundations. Across multiple benchmarks, TB-HSU outperforms diverse baselines, and the 3DHSG representation enhances open-vocabulary reasoning in GPT-4o for tasks like locating non-visible objects. The work demonstrates that incorporating spatial context into affordances improves 3D scene understanding and can bolster LLM-driven reasoning for task planning and navigation.

Abstract

The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.

Paper Structure

This paper contains 25 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview: The affordances of objects vary depending on the granularity of the context. We propose a multi-task method, Transformer Based Hierarchical Scene Understanding (TB-HSU), employing a transformer-based model for 3D scene understanding to classify the room and identify the region, forming a 3D Hierarchical Scene Graph (3DHSG) with three layers: Objects, Regions, and Rooms.
  • Figure 2: TB-HSU Model Overview: The model automatically constructs the 3DHSG for a room by completing room and region classifications, with pairs of instance-segmented point cloud and object semantic labels as inputs. The semantic embedding is derived from object labels, while position embedding is derived from object points.
  • Figure 3: 3DHSG from TB-HSU assist GPT-4o in a Question-Answering task to find an object not visible within the scene. Fig(a), Fig(b), Fig(c) are inserted appropriately place within the prompts.
  • Figure 4: Different room categories in 3DHSG dataset
  • Figure 5: Reigion-specific affordances distribution in 3DHSG dataset
  • ...and 4 more figures