Table of Contents
Fetching ...

CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding

Jingyu Liu, Wenhan Xiong, Ian Jones, Yixin Nie, Anchit Gupta, Barlas Oğuz

TL;DR

CLIP-Layout tackles indoor scene synthesis by moving beyond category-level placement to instance-level, style-aware generation. It combines floor-plan encoding, a permutation-invariant transformer, and CLIP-based object embeddings derived from multi-view renders to produce coherent, style-consistent furniture layouts and enable zero-shot text-guided editing. On 3D-FRONT, it achieves state-of-the-art auto-completion and partial-synthesis metrics and demonstrates versatile text-driven synthesis and furniture replacement without retraining. The approach broadens applicability to unseen furniture and supports downstream tasks such as immersive environment generation and embodied agent training. Limitations include data scarcity and some failure modes, inviting larger datasets and integration of explicit design priors in future work.

Abstract

Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as homes for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level visual attributes of objects, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, using general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more functionally plausible and aesthetically pleasing scenes. Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene synthesis and improves auto-completion metrics by over 50%. Moreover, our embedding-based approach enables zero-shot text-guided scene synthesis and editing, which easily generalizes to furniture not seen during training.

CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding

TL;DR

CLIP-Layout tackles indoor scene synthesis by moving beyond category-level placement to instance-level, style-aware generation. It combines floor-plan encoding, a permutation-invariant transformer, and CLIP-based object embeddings derived from multi-view renders to produce coherent, style-consistent furniture layouts and enable zero-shot text-guided editing. On 3D-FRONT, it achieves state-of-the-art auto-completion and partial-synthesis metrics and demonstrates versatile text-driven synthesis and furniture replacement without retraining. The approach broadens applicability to unseen furniture and supports downstream tasks such as immersive environment generation and embodied agent training. Limitations include data scarcity and some failure modes, inviting larger datasets and integration of explicit design priors in future work.

Abstract

Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as homes for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level visual attributes of objects, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, using general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more functionally plausible and aesthetically pleasing scenes. Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene synthesis and improves auto-completion metrics by over 50%. Moreover, our embedding-based approach enables zero-shot text-guided scene synthesis and editing, which easily generalizes to furniture not seen during training.
Paper Structure (26 sections, 2 equations, 11 figures, 5 tables)

This paper contains 26 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Automatic scene synthesis: With a floor plan and an optional style description prompt as conditions, our model can synthesize rooms with style-consistency, diversity, and visual realism.
  • Figure 2: Object embedding: CLIP-Layout calculates the semantic embedding of a 3D mesh by feeding images rendered from eight canonical directions to the CLIP image encoder.
  • Figure 3: Partial scene completion: From the left to right columns are the partial scenes, the completed scenes from ATISS, and the ones generated by our model. CLIP-Layout can distinguish not only color information, but also furniture material and shapes.
  • Figure 4: Scene synthesis from room masks: Visual examples of scenes generated from floor plans where the top row contains samples from ATISS and bottom row by CLIP-Layout()
  • Figure 5: Furniture replacement using text prompts: Each pair consists of the original scene on the left and the one with the replaced furniture on the right, with the text prompt below it.
  • ...and 6 more figures