Table of Contents
Fetching ...

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Inwoo Hwang, Byoung-Tak Zhang

TL;DR

This work tackles spatial inconsistencies in diffusion-based image generation by jointly modeling images and intrinsic scene properties (depth, normals, segmentation, line drawings). It introduces Intrinsic Latent Diffusion Models (I-LDM) that co-generate images and intrinsics, using an intrinsic VAE to encode multiple intrinsics into a single latent and cross-domain self-attention with a weight scheduling mechanism to align domains while preserving image fidelity. The approach demonstrates improved spatial coherence and more natural scene layouts across diverse prompts, hand-structure generation, and adapts to multiple base models (e.g., SD2.1, SDXL, PixArt-alpha) without sacrificing base-model quality. By leveraging pre-trained intrinsic estimators and a LoRA-based intrinsic domain, I-LDM offers a practical path to more realistic and structurally faithful image generation with broad applicability in T2I systems.

Abstract

Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

TL;DR

This work tackles spatial inconsistencies in diffusion-based image generation by jointly modeling images and intrinsic scene properties (depth, normals, segmentation, line drawings). It introduces Intrinsic Latent Diffusion Models (I-LDM) that co-generate images and intrinsics, using an intrinsic VAE to encode multiple intrinsics into a single latent and cross-domain self-attention with a weight scheduling mechanism to align domains while preserving image fidelity. The approach demonstrates improved spatial coherence and more natural scene layouts across diverse prompts, hand-structure generation, and adapts to multiple base models (e.g., SD2.1, SDXL, PixArt-alpha) without sacrificing base-model quality. By leveraging pre-trained intrinsic estimators and a LoRA-based intrinsic domain, I-LDM offers a practical path to more realistic and structurally faithful image generation with broad applicability in T2I systems.

Abstract

Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

Paper Structure

This paper contains 51 sections, 19 equations, 21 figures, 10 tables, 1 algorithm.

Figures (21)

  • Figure 1: By co-generating images and aligned intrinsic scene properties, we aim to address the problem of spatial inconsistency prevalent in existing text-to-image models. (Gray) Paradoxical image generated by Stable Diffusion 2.1, including inconsistent wall, object (green circle), and floor (orange circle). (Beige) Our approach generates an image and intrinsic scene properties representing the scene from diverse perspectives, thereby producing a more natural and realistic image.
  • Figure 2: The overall architecture of I-LDM. (Left) We first train an intrinsic VAE to encode all intrinsics into a single latent variable. (Middle) Then, we train the LoRA weights of the self-attention layers included in the diffusion network of the intrinsic domain to learn the denoising process. (Right) We employ a weight scheduling mechanism for exchanging self-attention with the image domain. As a result, I-LDM simultaneously generates spatially consistent images and intrinsics during inference.
  • Figure 3: Qualitative analysis of generated images and co-generated intrinsic scene properties from I-LDM (clockwise from top left: depth map, surface normal, line drawing, and segmentation map). The red boxes indicate generated images of the base model. The top and bottom rows visualize samples with Drop and Gaussian weight scheduling, respectively. Red and blue captions denote samples from the Parti and Multi prompts, respectively.
  • Figure 4: Comparison of base and I-LDM with generated intrinsics, providing key cues for reducing spatial inconsistencies.
  • Figure 5: Comparison of the base model and I-LDM with Drop, Gaussian, and no weight scheduling.
  • ...and 16 more figures