Table of Contents
Fetching ...

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang

Abstract

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Abstract

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.
Paper Structure (20 sections, 8 figures, 5 tables)

This paper contains 20 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 6: Qualitative comparison with graph-conditioned generative models. The top row illustrates the input graphs, highlighting key semantic and spatial edges. In the generated scenes, red rectangles indicate generation failures in baselines (e.g., collisions), whereas green rectangles mark consistent regions in our results. * denotes the retrieval mode. $^\dag$Text-only scene graphs are given.
  • Figure 7: Qualitative comparison on object generation with other methods.
  • Figure 8: Failure case. The left panel shows the input multimodal scene graph, while the right panel shows the generated failure case. Red cross marks indicate removed relationships.
  • Figure 9: Prompt template for evaluating FPVScore huang2025video with GPT-4o achiam2023gpt.
  • Figure 10: Perceptual Study Interface: Instructions and Metrics. This introductory page is presented to participants at the start of the perceptual study. It outlines the study objectives, provides detailed instructions for viewing and rating, and clearly defines the five evaluation metrics (PA, LC, VQ, SC, OP).
  • ...and 3 more figures