Table of Contents
Fetching ...

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas

TL;DR

SceneScript presents a novel autoregressive, text-based representation of 3D indoor scenes by predicting a sequence of structured language commands from egocentric video. By encoding geometry with point clouds, lifted image features, or end-to-end pose views and decoding into a flexible command language, it achieves state-of-the-art layout estimation and competitive 3D object detection on Aria Synthetic Environments. The approach is highly extensible, enabling new tasks such as primitive-based object reconstruction and curved-wall modeling with minimal changes to the network. The large ASE dataset and language-based representation open pathways for interactive editing, querying, and task augmentation through language, aligning 3D reconstruction with recent LLM capabilities.

Abstract

We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

TL;DR

SceneScript presents a novel autoregressive, text-based representation of 3D indoor scenes by predicting a sequence of structured language commands from egocentric video. By encoding geometry with point clouds, lifted image features, or end-to-end pose views and decoding into a flexible command language, it achieves state-of-the-art layout estimation and competitive 3D object detection on Aria Synthetic Environments. The approach is highly extensible, enabling new tasks such as primitive-based object reconstruction and curved-wall modeling with minimal changes to the network. The large ASE dataset and language-based representation open pathways for interactive editing, querying, and task augmentation through language, aligning 3D reconstruction with recent LLM capabilities.

Abstract

We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.
Paper Structure (61 sections, 4 equations, 17 figures, 10 tables)

This paper contains 61 sections, 4 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: footnotesize(top) Given an egocentric video of an environment, SceneScript directly predicts a 3D scene representation consisting of structured scene language commands. (bottom) Our method generalizes on diverse real scenes while being solely trained on synthetic indoor environments. (last column, bottom) A notable advantage of our method is its capacity to easily adapt the structured language to represent novel scene entities. For example, by introducing a single new command, SceneScript can directly predict object parts jointly with the layout and bounding boxes.
  • Figure 2: Aria Synthetic Environments: (top) Random samples of generated scenes showing diversity of layouts, lights and object placements. (bottom - left to right) A top down view of a scene filled with objects, a simulated trajectory (blue path), renderings of depth, RGB, and object instances, and lastly a scene pointcloud.
  • Figure 3: SceneScript core pipeline overview. Raw images & pointcloud data are encoded into a latent code, which is then autoregressively decoded into a sequence of commands that describe the scene. Visualizations are shown using a customly built interpreter. Note that for the results in this paper, the the point clouds are computed from the images using Aria MPS aria_white_paper -- i.e. are not using a dedicated RGB-D / Lidar sensor.
  • Figure 4: Qualitative samples between our model and SOTA methods on Aria Synthetic Environments's test set. Hierarchical methods like SceneCAD suffer from error cascading which leads to missing elements in the edge prediction module. RoomFormer (a 2D method extruded to 3D) primarily suffers from lightly captured scene regions which leave a unnoticeable signal in the density map.
  • Figure 5: Example scene reconstructions on scenes from Aria Synthetic Environments. (left) Visualisation of the decomposed meshes used to create make_prim training pairs. (right) Views of full scene predictions, as well as close ups highlighting the fidelity of object reconstruction through the prediction volumetric primitives enabled by make_prim.
  • ...and 12 more figures