Table of Contents
Fetching ...

PARSE: Part-Aware Relational Spatial Modeling

Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu

TL;DR

PARSE is introduced, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations and significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

Abstract

Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

PARSE: Part-Aware Relational Spatial Modeling

TL;DR

PARSE is introduced, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations and significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

Abstract

Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
Paper Structure (20 sections, 6 figures, 3 tables)

This paper contains 20 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the capabilities of PARSE. Leveraging PARSE together with physics simulation, we can construct Part-centric Assembly Graphs (PAGs) (b) that closely match the spatial organization of objects in a real image (a), and use them to generate physically plausible 3D scenes (c) with diverse object relationships and rich part-level contacts. In addition, we build PARSE-10K, a collection of high-quality 3D indoor scenes with fully part-segmented object instances. PARSE-10K effectively supports downstream tasks such as finetuning VLM for spatial reasoning and 3D scene generation.
  • Figure 2: An illustrative example of our Part-centric Assembly Graph (PAG). The left panel shows a PARSE-generated scene, where several regions governed by spatial relation constraints are highlighted. The middle panel presents the corresponding PAG: object nodes are visualized as full circular nodes, while part nodes are shown as partially clipped nodes attached to their parent object; the sub-PAGs corresponding to the highlighted regions are also emphasized. The right panel zooms into one such sub-PAG, where we annotate, for each part node, the oriented face used in defining its relational constraint.
  • Figure 3: Controllable Scene Synthesis via Part-Aware Spatial Configuration Solver. (a) Coarse Localization: The solver first prunes the 2D support surface of occupied regions (red), then further contracts the feasible space using object-level spatial relations (orange). (b) Part-Level Alignment: Precise geometric alignment is achieved by enforcing constraints (e.g., coplanarity) between specific surfaces identified by the solver. This drastically shrinks the pose space from which a final pose is sampled. (c) Fine-Grained Relational Control: Specifying different part-level geometric relations in the PAG results in distinct and predictable arrangements, showcasing the framework's fine-grained controllability.
  • Figure 4: Gallery of PARSE-10K
  • Figure 5: Visualization of model-predicted graphs. Green boxes indicate objects correctly matched by both label and grounding; red boxes indicate failed matches; gray boxes denote missed detections. Green arrows denote relations judged correct under the grounding-agnostic metric, red arrows denote incorrect relations.
  • ...and 1 more figures