Table of Contents
Fetching ...

Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

Ohad Rahamim, Hilit Segev, Idan Achituve, Yuval Atzmon, Yoni Kasten, Gal Chechik

TL;DR

Lay-A-Scene tackles Open-set-3D-Arrange by exploiting pre-trained text-to-image diffusion priors to arrange unseen 3D objects. It first personalizes a diffusion model with renderings of the given objects to generate a scene image that respects a textual description, then infers 3D object poses by matching 2D features to 3D renders and solving a constrained PnP optimization (SI-PnP) to enforce physical coherence. Key contributions include formalizing Open-set-3D-Arrange, introducing Side-Information PnP with ground-plane and collision penalties, and an iterative object-merging strategy to handle larger object scenes, all validated on Objaverse with human judgments. The approach demonstrates how to distill 2D diffusion priors into robust 3D layouts without retraining the foundation model, enabling scalable, high-quality scene synthesis for open-world object sets.

Abstract

Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves the task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglecting any of them. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.

Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

TL;DR

Lay-A-Scene tackles Open-set-3D-Arrange by exploiting pre-trained text-to-image diffusion priors to arrange unseen 3D objects. It first personalizes a diffusion model with renderings of the given objects to generate a scene image that respects a textual description, then infers 3D object poses by matching 2D features to 3D renders and solving a constrained PnP optimization (SI-PnP) to enforce physical coherence. Key contributions include formalizing Open-set-3D-Arrange, introducing Side-Information PnP with ground-plane and collision penalties, and an iterative object-merging strategy to handle larger object scenes, all validated on Objaverse with human judgments. The approach demonstrates how to distill 2D diffusion priors into robust 3D layouts without retraining the foundation model, enabling scalable, high-quality scene synthesis for open-world object sets.

Abstract

Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves the task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglecting any of them. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.
Paper Structure (37 sections, 7 equations, 13 figures, 4 tables)

This paper contains 37 sections, 7 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Lay-A-Scene is a test-time optimization method for finding plausible layouts of given 3D objects, leveraging a pre-trained text-to-image diffusion model. Given as input several mesh objects (left in each panel) and a textual description of the scene ("a bedroom" or "garden"), the text-to-image model generates a scene image with these objects (on the bottom-right). This image is used to find an arrangement of the 3D input objects (at the center of each panel).
  • Figure 2: Lay-A-Scene consists of two phases. First, given objects are used to personalize a text-to-image model, and a scene image is generated. In the second phase, we find a transformation $Tr_i$ for each 3D object $i$ to match the 2D arrangement presented in the generated scene image. $Tr_i$ is found using our SI-PnP, by matching the DIFT representation of objects and scene image.
  • Figure 3: Establishing correspondences between keypoints of a 3D object and a scene image, by extracting the DIFT feature from the fine-tuned SD model. The process begins with rendering 2D images of a given object, used to map between features of the 3D object and the scene-generated image. On the right, each connecting line indicates a matched feature pair.
  • Figure 4: Qualitative examples of 2-object layouts generated by Lay-A-Scene.
  • Figure 5: Iterative approach results: Lay-A-Scene layout for multi-object scenes. We show scenes of a living room and a bedroom, each with multiple objects. We present scenes with 4 to 7 objects produced with the iterative approach.
  • ...and 8 more figures