Table of Contents
Fetching ...

Functionality understanding and segmentation in 3D scenes

Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, Fabio Poiesi

TL;DR

Fun3DU tackles the problem of functionality understanding in 3D scenes by proposing a training-free pipeline that combines Chain-of-Thought reasoning in a frozen LLM with open-vocabulary 2D segmentation and 2D-to-3D grounding via vision-language models. The method processes a scene's point cloud, multiple views, and a natural language task to identify contextual and functional objects, selects informative views through a visibility-based ranking, and fuses 2D masks into a consistent 3D segmentation via multi-view agreement. Evaluated on SceneFun3D, Fun3DU outperforms prior open-vocabulary 3D segmentation baselines by substantial margins, achieving AP$_{25}$ above 0.4 and mIoU above 0.2, while maintaining high recall and improved precision. The work demonstrates the practicality of zero-shot functionality understanding, highlighting the importance of reasoning over purely visual cues and offering a scalable approach that leverages pre-trained models without fine-tuning.

Abstract

Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Project page: https://tev-fbk.github.io/fun3du/

Functionality understanding and segmentation in 3D scenes

TL;DR

Fun3DU tackles the problem of functionality understanding in 3D scenes by proposing a training-free pipeline that combines Chain-of-Thought reasoning in a frozen LLM with open-vocabulary 2D segmentation and 2D-to-3D grounding via vision-language models. The method processes a scene's point cloud, multiple views, and a natural language task to identify contextual and functional objects, selects informative views through a visibility-based ranking, and fuses 2D masks into a consistent 3D segmentation via multi-view agreement. Evaluated on SceneFun3D, Fun3DU outperforms prior open-vocabulary 3D segmentation baselines by substantial margins, achieving AP above 0.4 and mIoU above 0.2, while maintaining high recall and improved precision. The work demonstrates the practicality of zero-shot functionality understanding, highlighting the importance of reasoning over purely visual cues and offering a scalable approach that leverages pre-trained models without fine-tuning.

Abstract

Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Project page: https://tev-fbk.github.io/fun3du/

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We present Fun3DU, the first method for functionality understanding and segmentation in 3D scenes. Fun3DU interprets natural language descriptions (left-hand side) in order to segment functional objects in real-world 3D environments (right-hand side). Fun3DU relies on world knowledge and vision perception capabilities of pre-trained vision and language models, without requiring task-specific finetuning.
  • Figure 2: Fun3DU consists of four main modules. The first module (green) interprets the natural language task description using Chain-of-Thought reasoning with a frozen LLM, identifying the contextual object (pink) and functional object (azure) to segment. The second module (pink) segments the contextual object in input 2D images using an open-vocabulary segmentor, followed by a score-based view selection to discard views where the contextual object is occluded or absent (e.g., the leftmost image). In the third module (azure), the functional objects are segmented in the selected views by a VLM paired with a promptable segmentor. The fourth module (purple) lifts the 2D masks in 3D via 2D-3D correspondences, performs multi-view agreement, and outputs the 3D segmentation masks of the functional objects.
  • Figure 3: Example of LLM reasoning on a task description (in red). First, we pass a system message to condition the LLM, using the possible actions defined by SceneFun3D. Then, we ask to respond with a JSON structure, that includes a "task_solving_sequence" field to perform Chain-of-Thought reasoning.
  • Figure 4: Given the example masks in the first column, the second and third column show respectively the distance distribution $\texttt{P}_{d_\texttt{O}\xspace}$ and the angle distribution $\texttt{P}_{\alpha_\texttt{O}\xspace}$. The coordinates are normalized, so that $d_\texttt{O}\xspace \in [0,\sqrt{2}]$.
  • Figure 5: Qualitative examples of Fun3DU and its baselines on split0 of SceneFun3D delitzas2024scenefun3d. Point clouds are cropped around the functional object for better visualization. We report mask-level Precision (Prc), Recall (Rec) and IoU.
  • ...and 1 more figures