Table of Contents
Fetching ...

PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling

Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück

TL;DR

PRISM introduces a unified diffusion-transformer framework that jointly generates RGB images and intrinsic scene maps (X layers), enabling text-to-RGBX, RGB-to-X, and X-to-RGBX tasks while supporting global and local editing through conditioning on selected intrinsic maps. By expanding the latent token space to multiple modalities and training with partial modality availability, PRISM achieves improved cross-modal alignment without sacrificing the base model's text-to-image capabilities. Extensive quantitative and qualitative evaluations demonstrate competitive intrinsic decomposition performance and strong conditional generation, with practical applications in relighting and material editing. The work highlights the benefits of a single, multi-task model for perception and generation, and points to indoor-scene data as a current limitation and avenue for future expansion toward additional modalities and broader domains.

Abstract

We present PRISM, a unified framework that enables multiple image generation and editing tasks in a single foundational model. Starting from a pre-trained text-to-image diffusion model, PRISM proposes an effective fine-tuning strategy to produce RGB images along with intrinsic maps (referred to as X layers) simultaneously. Unlike previous approaches, which infer intrinsic properties individually or require separate models for decomposition and conditional generation, PRISM maintains consistency across modalities by generating all intrinsic layers jointly. It supports diverse tasks, including text-to-RGBX generation, RGB-to-X decomposition, and X-to-RGBX conditional generation. Additionally, PRISM enables both global and local image editing through conditioning on selected intrinsic layers and text prompts. Extensive experiments demonstrate the competitive performance of PRISM both for intrinsic image decomposition and conditional image generation while preserving the base model's text-to-image generation capability.

PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling

TL;DR

PRISM introduces a unified diffusion-transformer framework that jointly generates RGB images and intrinsic scene maps (X layers), enabling text-to-RGBX, RGB-to-X, and X-to-RGBX tasks while supporting global and local editing through conditioning on selected intrinsic maps. By expanding the latent token space to multiple modalities and training with partial modality availability, PRISM achieves improved cross-modal alignment without sacrificing the base model's text-to-image capabilities. Extensive quantitative and qualitative evaluations demonstrate competitive intrinsic decomposition performance and strong conditional generation, with practical applications in relighting and material editing. The work highlights the benefits of a single, multi-task model for perception and generation, and points to indoor-scene data as a current limitation and avenue for future expansion toward additional modalities and broader domains.

Abstract

We present PRISM, a unified framework that enables multiple image generation and editing tasks in a single foundational model. Starting from a pre-trained text-to-image diffusion model, PRISM proposes an effective fine-tuning strategy to produce RGB images along with intrinsic maps (referred to as X layers) simultaneously. Unlike previous approaches, which infer intrinsic properties individually or require separate models for decomposition and conditional generation, PRISM maintains consistency across modalities by generating all intrinsic layers jointly. It supports diverse tasks, including text-to-RGBX generation, RGB-to-X decomposition, and X-to-RGBX conditional generation. Additionally, PRISM enables both global and local image editing through conditioning on selected intrinsic layers and text prompts. Extensive experiments demonstrate the competitive performance of PRISM both for intrinsic image decomposition and conditional image generation while preserving the base model's text-to-image generation capability.

Paper Structure

This paper contains 15 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Pipeline of our PRISM model. RGB image and its corresponding X intrinsic channels are encoded into latent space via a fixed VAE Encoder. A Diffusion Transformer is applied on the tokens of the latent of all channels simultaneously and conditioned by the text embedding from an input text prompt. Denoised tokens are passing through a fixed decoder for RGB+X generation. During training, intrinsic channels are randomly ablated which makes PRISM a unified framework for text-to-image generation, intrinsic decomposition, and conditional image generation with any subset of intrinsic images.
  • Figure 2: Sample results generated with PRISM. Our model is capable of text, text+X and X conditioned generation and can inherently perform image decomposition and re-composition under different material, geometry and lighting conditions. Condition channels are highlighted in orange.
  • Figure 3: Visual comparison of our PRISM model against baseline methods on synthetic datasets. All input images and ground truths are from the HyperSim dataset, except for the classroom scene (b, right).
  • Figure 4: Visual comparison of RGB reconstruction from predicted albedo and irradiance for the white balance alignment. We compare PRISM model against Zeng2024RGBXID and ground truth reconstructions.
  • Figure 5: Relighting with text prompt. Starting from an input RGB image, intrinsic layers are predicted using PRISM. We then apply PRISM with a text prompt describing a new light condition together with all predicted intrinsic layers except irradiance map. Our relit results preserve the geometric and material properties of the original scenes while achieving plausible appearance under desired lighting conditions.
  • ...and 5 more figures