Table of Contents
Fetching ...

Language-guided 3D scene synthesis for fine-grained functionality understanding

Jaime Corsetti, Francesco Giuliari, Davide Boscaini, Pedro Hermosilla, Andrea Pilzer, Guofeng Mei, Alexandros Delitzas, Francis Engelmann, Fabio Poiesi

TL;DR

SynthFun3D introduces a training-free, task-driven 3D scene synthesis pipeline that generates functional-scene data from natural language prompts. By combining LLM parsing, dual-asset retrieval, metadata-guided mask extraction, and hard-constrained DFS-based layout optimization, it produces scenes with precise part-level masks for interactive elements. The approach yields data that can match or complement real data for functionality understanding tasks, achieving notable improvements on SceneFun3D benchmarks and demonstrating scalable data generation for downstream perception models. Photorealistic augmentation via Cosmos further broadens realism but reveals persistent hallucinations, guiding future work toward broader asset coverage and physically grounded constraints.

Abstract

Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.

Language-guided 3D scene synthesis for fine-grained functionality understanding

TL;DR

SynthFun3D introduces a training-free, task-driven 3D scene synthesis pipeline that generates functional-scene data from natural language prompts. By combining LLM parsing, dual-asset retrieval, metadata-guided mask extraction, and hard-constrained DFS-based layout optimization, it produces scenes with precise part-level masks for interactive elements. The approach yields data that can match or complement real data for functionality understanding tasks, achieving notable improvements on SceneFun3D benchmarks and demonstrating scalable data generation for downstream perception models. Photorealistic augmentation via Cosmos further broadens realism but reveals persistent hallucinations, guiding future work toward broader asset coverage and physically grounded constraints.

Abstract

Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.

Paper Structure

This paper contains 37 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Interfaces used for the three user studies described in Sec \ref{['sec:supp_user']}. We report the interface used for the scene-prompt coherence study (a), for mask retrieval (b) and for object retrieval (c).
  • Figure 2: SynthFun3D consists of three main stages, illustrated by the numbered blocks. Firstly, the input prompt is parsed by an LLM to extract a structured description of the room layout and the objects of interest. Secondly, the target object is retrieved from a large-scale dataset annotated with masks and semantic labels. The retrieval pipeline considers both the similarity of the asset with a provided description (1) and the arrangement and quantity of functional elements on it (2-3). Finally, the retrieved objects are arranged within the generated room according to the spatial relationships explicitly defined in the input prompt, via a Depth-First-Search algorithm.
  • Figure 2: User study on the correctness of the observed room layout with respect to the prompt. We compare the default Holodeck, Holodeck with hard constraints, and our method.
  • Figure 3: Left: A synthetic RGB image and its corresponding multi-instance segmentation mask generated with Blender blender. Right: Photorealistic variants produced by our prompt-driven style transfer pipeline based on Cosmos-Transfer 2.5 cosmos. The generated images show spatial and semantic consistency with the input segmentation while exhibiting substantial diversity in style, materials, and illumination.
  • Figure 3: Qualitative examples of 3D rooms generated with SynthFun3D. Each example shows a top-down view of the generated room, the corresponding functional prompt (white box, top), the relationships between the objects mentioned in the prompt (green lines), and rendered data from a random viewpoint within the room (magenta close-up, bottom).
  • ...and 13 more figures