Table of Contents
Fetching ...

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, Yang Liu, Yikai Wang

TL;DR

SceneTransporter formulates and solves an entropic Optimal Transport objective within the denoising loop of the compositional DiT model, and significantly improves instance-level coherence and geometric fidelity.

Abstract

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

TL;DR

SceneTransporter formulates and solves an entropic Optimal Transport objective within the denoising loop of the compositional DiT model, and significantly improves instance-level coherence and geometric fidelity.

Abstract

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.
Paper Structure (41 sections, 18 equations, 11 figures, 4 tables)

This paper contains 41 sections, 18 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Comparison between our end-to-end scene generation pipeline in (c) with compositional 3D latent diffusion and existing "divide and conquer" methods.
  • Figure 2: Qualitative Results on Vecset-based Latent Probing.Cluster and Cluster with CCA are our probes that perform in the compositional latent space of PartPacker; VAE clusters the latent obtained by encoding the fused geometry produced by PartPacker into the VAE. Colors denote part assignments.
  • Figure 2: User Study. Human evaluation of different structure 3D scene generation methods across multiple aspects. Scores range from 1 to 4, with higher scores indicating better performance. Bold values represent the best performance within each metric.
  • Figure 3: Overview of the SceneTransporter pipeline. At each denoising step $t$, our Optimal-Transport–Guided Correlation Assignment framework formulates a global OT problem between image patches and part-level tokens within the compositional latent DiT. We compute a part-patch cost from Q/K similarity, regularized by image edges, and solve for an optimal transport plan using Sinkhorn iteration. The OT plan gates the cross attention to enforce an explicit patch-to-part routing, and the resulting gated attention map updates the latent $z_t$. Attention maps transport over time, showing assignments becoming sharper and more instance-consistent.
  • Figure 4: Qualitative Comparison on Structured 3D Scene Generation across Methods. Different colors indicate different parts in the generated 3D scene.
  • ...and 6 more figures