Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors
Ohad Rahamim, Hilit Segev, Idan Achituve, Yuval Atzmon, Yoni Kasten, Gal Chechik
TL;DR
Lay-A-Scene tackles Open-set-3D-Arrange by exploiting pre-trained text-to-image diffusion priors to arrange unseen 3D objects. It first personalizes a diffusion model with renderings of the given objects to generate a scene image that respects a textual description, then infers 3D object poses by matching 2D features to 3D renders and solving a constrained PnP optimization (SI-PnP) to enforce physical coherence. Key contributions include formalizing Open-set-3D-Arrange, introducing Side-Information PnP with ground-plane and collision penalties, and an iterative object-merging strategy to handle larger object scenes, all validated on Objaverse with human judgments. The approach demonstrates how to distill 2D diffusion priors into robust 3D layouts without retraining the foundation model, enabling scalable, high-quality scene synthesis for open-world object sets.
Abstract
Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves the task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglecting any of them. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.
