Table of Contents
Fetching ...

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Qi Sun, Hang Zhou, Wengang Zhou, Li Li, Houqiang Li

TL;DR

Forest2Seq recasts indoor scene synthesis as an order-aware sequential problem by extracting meaningful orderings from unordered object sets into scene trees and forests. It then applies a decoder-only transformer with a denoising strategy to autoregressively place furniture, guided by a ViT-based layout encoder and a rich object attribute encoder. The approach achieves state-of-the-art or competitive FID and KL scores on 3D-FRONT, demonstrates practical benefits for scene completion and rearrangement, and validates the importance of a learned order prior in 3D scene generation. Limitations include neglecting doors/windows and occasional overlaps, with future work aimed at jointly learning order and incorporating additional spatial constraints. Overall, Forest2Seq advances efficient, realistic 3D indoor scene synthesis by integrating hierarchical ordering with powerful sequential generation.

Abstract

Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

TL;DR

Forest2Seq recasts indoor scene synthesis as an order-aware sequential problem by extracting meaningful orderings from unordered object sets into scene trees and forests. It then applies a decoder-only transformer with a denoising strategy to autoregressively place furniture, guided by a ViT-based layout encoder and a rich object attribute encoder. The approach achieves state-of-the-art or competitive FID and KL scores on 3D-FRONT, demonstrates practical benefits for scene completion and rearrangement, and validates the importance of a learned order prior in 3D scene generation. Limitations include neglecting doors/windows and occasional overlaps, with future work aimed at jointly learning order and incorporating additional spatial constraints. Overall, Forest2Seq advances efficient, realistic 3D indoor scene synthesis by integrating hierarchical ordering with powerful sequential generation.

Abstract

Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.
Paper Structure (18 sections, 8 equations, 13 figures, 7 tables, 3 algorithms)

This paper contains 18 sections, 8 equations, 13 figures, 7 tables, 3 algorithms.

Figures (13)

  • Figure 1: We present Forest2Seq that mines the implicit hierarchy from the scene (bottom left), employing the tree-derived ordering as significant prior to direct the sequential indoor scene synthesis (top). The presence of placing-adaptable furniture items (bottom right), exemplified by the cabinets, necessitate the evolution from a single tree to scene forest representation.
  • Figure 2: Training framework of our Forest2Seq. On the left, we depict the construction of a tree/forest from parsing the scene and its subsequent flattening into a sequence through breadth-first search. The right panel illustrates our use of a causal transformer equipped with a denoising strategy for sequential data learning.
  • Figure 3: An example to illustrate the motivation of scene forest. The whole room is clearly divided to 2 subscenes according to the human activity. However, "cabinet" is an exception as it can reasonably belong to any subscene or the entire scene. Note that some items in the base tree are ignored for simplification.
  • Figure 4: Qualitative comparison with the-state-of-the-art methods paschalidou2021atisscofstang2023diffuscene on scene synthesis for three type of scenes: bedrooms (1st row), living room (2nd and 3rd rows) and dining room (4th row). Note that reference is the scene from dataset with the same floor plan.
  • Figure 5: Ablation study: visual comparison of different orderings. While random order and fixed order can not provide appropriate prior, our tree-guided order benefits the scene synthesis, generating plausible scenes. The forest representation further enhances the scene diversity and realism.
  • ...and 8 more figures