Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
Qi Sun, Hang Zhou, Wengang Zhou, Li Li, Houqiang Li
TL;DR
Forest2Seq recasts indoor scene synthesis as an order-aware sequential problem by extracting meaningful orderings from unordered object sets into scene trees and forests. It then applies a decoder-only transformer with a denoising strategy to autoregressively place furniture, guided by a ViT-based layout encoder and a rich object attribute encoder. The approach achieves state-of-the-art or competitive FID and KL scores on 3D-FRONT, demonstrates practical benefits for scene completion and rearrangement, and validates the importance of a learned order prior in 3D scene generation. Limitations include neglecting doors/windows and occasional overlaps, with future work aimed at jointly learning order and incorporating additional spatial constraints. Overall, Forest2Seq advances efficient, realistic 3D indoor scene synthesis by integrating hierarchical ordering with powerful sequential generation.
Abstract
Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.
