Table of Contents
Fetching ...

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, Alireza Fathi

TL;DR

FirePlace presents a training-free framework that enables MLLMs to perform 3D object placement in complex scenes by grounding language into fine-grained geometric constraints and using external 3D reasoning tools. It introduces a constraint-outline generation stage, a multi-stage 3D reasoning pipeline with surface extraction and a constraint solver, and a plausibility pruning stage, augmented by Batched Visual Selection for scalable grounding. Empirical results on 50 USD scenes (266 tasks) show FirePlace outperforms Holodeck and LayoutGPT in geometric fidelity, plausibility, and visibility, with human evaluations corroborating physical feasibility and common-sense alignment. The work highlights the potential of combining MLLMs with explicit geometry for 3D scene construction and outlines limitations and future directions like latency and broader constraint coverage.

Abstract

Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

TL;DR

FirePlace presents a training-free framework that enables MLLMs to perform 3D object placement in complex scenes by grounding language into fine-grained geometric constraints and using external 3D reasoning tools. It introduces a constraint-outline generation stage, a multi-stage 3D reasoning pipeline with surface extraction and a constraint solver, and a plausibility pruning stage, augmented by Batched Visual Selection for scalable grounding. Empirical results on 50 USD scenes (266 tasks) show FirePlace outperforms Holodeck and LayoutGPT in geometric fidelity, plausibility, and visibility, with human evaluations corroborating physical feasibility and common-sense alignment. The work highlights the potential of combining MLLMs with explicit geometry for 3D scene construction and outlines limitations and future directions like latency and broader constraint coverage.

Abstract

Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.

Paper Structure

This paper contains 29 sections, 29 figures, 6 tables, 2 algorithms.

Figures (29)

  • Figure 1: FirePlace pipeline. [Stage 1] FirePlace first generates a set of constraint outlines, describing in text from the applicable constraints and the corresponding interacting surfaces. [Stages 2-4] FirePlace then selects the anchor object using Batched Visual Selection on instance segmentation masks. It extracts the surfaces that best match the constraint outline, and then uses a constraint solver to produce feasible layouts. [Stage 5] Finally, it uses an MLLM to select a subset of placements that adhere to common sense principles.
  • Figure 2: Qualitative samples of object placements (shown in red masks) within 3D scenes based on language instructions. FirePlace can place diverse objects in a variety of settings, and produce geometrically feasible and semantically plausible object placements.
  • Figure 3: Comparisons against Holodeck. Holodeck fails to put the collection of books onto the shelf (due to its bounding box representation), and produces many implausible placements due to incorrect selection of anchor objects using the caption-based selection method.
  • Figure 4: Comparisons against LayoutGPT. LayoutGPT produces implausible object placements with intersections, showing that LLMs often fail to accurately estimate object positions and should be guided by constraints, as done in FirePlace.
  • Figure 5: Common failure modes. On the left, the placement of the object overlaps with preexisting objects, due to the constraint library not including a constraint to minimize intersections. In the middle, the placement of the chair was not constrained beyond contact to the ground, but additional constraints should have been generated (such as parallelism between the backs of the masked chair and the adjacent chair). On the right, the plausibility pruning step failed to remove implausible placements in the event of under-constrained placements (the bottom of the books are in contact with the table, but is overhanging), leading to a placement result that features the book floating over the edge of the table.
  • ...and 24 more figures