Table of Contents
Fetching ...

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu

TL;DR

An occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint, that generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Abstract

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

TL;DR

An occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint, that generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Abstract

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
Paper Structure (31 sections, 29 figures, 3 tables)

This paper contains 31 sections, 29 figures, 3 tables.

Figures (29)

  • Figure 1: We propose SeeThrough3D, a method for occlusion aware 3D scene control in text-to-image generation. Our method enables (a) occlusion-aware 3D object placement in generated images, and (b) adheres well to complex layouts featuring many objects. Additionally, our method allows for (c) control over the camera viewpoint in the generated image.
  • Figure 2: OSCR: We propose Occlusion-Aware Scene Representation (OSCR) for 3D layout control in text-to-image generation. OSCR describes objects as translucent 3D boxes, which exposes occluded regions, enabling the generative model to reason about occlusions. Further, each box face is color-coded with a mapping to encode its 3D orientation. (a) A user specifies the object bounding boxes ($b_0$ and $b_1$) and sets desired viewpoint $\mathcal{C}$ in an interactive graphic environment. (b) These boxes are rendered to obtain our OSCR representation, (c) which is used to condition the generation for occlusion aware 3D control.
  • Figure 3: Towards occlusion aware 3D scene layouts: existing methods represent scenes as (a) 3D layout depth maps loosecontroleldesokey2024buildwang2025cinemaster, which fail to represent occluded objects (see dashed red box), or (b) object layers zhan2025larenderliang2025vodiff, which are not 3D aware, hence fail to capture camera viewpoint and perspective. (c) Therefore, we propose OSCR, where objects are described using translucent 3D bounding boxes. The transparency exposes occluded regions (red box), providing cues for occlusion reasoning, while enabling 3D layout control.
  • Figure 4: SeeThrough3D: We encode the rendered OSCR condition map r using the VAE to obtain OSCR tokens. These are concatenated with text prompt tokens $\mathbf{p}$ and noisy image tokens $\mathbf{x}_t$. The concatenated result is passed through the DiT based text-to-image model where they are jointly processed using self attention modules. We inject LoRA hu2021lora onto the attention projections corresponding to OSCR tokens; this enables control while preserving prior of the base model zhang2025easycontroltan2024ominicontroltan2025ominicontrol2.
  • Figure 5: (a) Inside the mmDiT block, text tokens $\mathbf{p}$, image tokens $\mathbf{x}_t$ and OSCR tokens $\mathbf{z}$ are jointly processed using self attention, conditioning the generation on our OSCR representation. To bind objects to corresponding boxes, we mask the attention to enable OSCR tokens within each box $\{ b_i \}$ to attend to corresponding object tokens $\{ \mathbf{p}_i \}$ using a mask $\mathbf{M}$ (b) For this, we require spatial extent for each object box $b_i$, which we obtain we use its amodal segmentation mask $\mathbf{s}_i$. When multiple boxes overlap, their region of intersection (green) attends to multiple objects.
  • ...and 24 more figures