Functionality understanding and segmentation in 3D scenes
Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, Fabio Poiesi
TL;DR
Fun3DU tackles the problem of functionality understanding in 3D scenes by proposing a training-free pipeline that combines Chain-of-Thought reasoning in a frozen LLM with open-vocabulary 2D segmentation and 2D-to-3D grounding via vision-language models. The method processes a scene's point cloud, multiple views, and a natural language task to identify contextual and functional objects, selects informative views through a visibility-based ranking, and fuses 2D masks into a consistent 3D segmentation via multi-view agreement. Evaluated on SceneFun3D, Fun3DU outperforms prior open-vocabulary 3D segmentation baselines by substantial margins, achieving AP$_{25}$ above 0.4 and mIoU above 0.2, while maintaining high recall and improved precision. The work demonstrates the practicality of zero-shot functionality understanding, highlighting the importance of reasoning over purely visual cues and offering a scalable approach that leverages pre-trained models without fine-tuning.
Abstract
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Project page: https://tev-fbk.github.io/fun3du/
