Table of Contents
Fetching ...

IntrinsiX: High-Quality PBR Generation using Image Priors

Peter Kocsis, Lukas Höllein, Matthias Nießner

TL;DR

IntrinsiX directly generates physically-based rendering maps (albedo, roughness, metallic, normals) from text, addressing the limitation of baked lighting in typical text-to-image outputs. It decomposes the problem into per-property priors learned via LoRA adapters and then aligns these priors with a cross-intrinsic attention mechanism, guided by a rendering loss that grounds the outputs in image-space signals. The two-stage training, combined with importance-based light sampling, yields semantically coherent, high-quality PBR maps that generalize to out-of-distribution prompts and support downstream tasks like relighting, editing, and room-scale PBR texturing. Experimental results show clear improvements over intrinsic decomposition baselines and demonstrate practical applicability in graphics pipelines, including 3D scene texturing. Overall, the work expands the role of text-conditioned diffusion models from RGB image synthesis to direct PBR map generation, enabling more flexible content creation for gaming and VR.

Abstract

We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

IntrinsiX: High-Quality PBR Generation using Image Priors

TL;DR

IntrinsiX directly generates physically-based rendering maps (albedo, roughness, metallic, normals) from text, addressing the limitation of baked lighting in typical text-to-image outputs. It decomposes the problem into per-property priors learned via LoRA adapters and then aligns these priors with a cross-intrinsic attention mechanism, guided by a rendering loss that grounds the outputs in image-space signals. The two-stage training, combined with importance-based light sampling, yields semantically coherent, high-quality PBR maps that generalize to out-of-distribution prompts and support downstream tasks like relighting, editing, and room-scale PBR texturing. Experimental results show clear improvements over intrinsic decomposition baselines and demonstrate practical applicability in graphics pipelines, including 3D scene texturing. Overall, the work expands the role of text-conditioned diffusion models from RGB image synthesis to direct PBR map generation, enabling more flexible content creation for gaming and VR.

Abstract

We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

Paper Structure

This paper contains 43 sections, 4 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Method Overview. We generate the intrinsic properties of an image given text as input. Left: we train 3 different LoRAs for a pretrained, latent text-to-image model, corresponding to the intrinsic properties (albedo, normal, and roughness + metallic) on curated synthetic datasets (\ref{['subsec:lora_training']}). We facilitate communication between all 4 modalities through cross-intrinsic attention to predict PBR maps corresponding to the same image (\ref{['subsubsec:cia']}). A novel rendering loss using importance-based light sampling ensures that we can render high-quality RGB images from physically realistic PBR maps (\ref{['subsubsec:rend-loss']}). Right: after training, we jointly denoise and decode all 4 PBR maps and can prompt our model with diverse, out-of-distribution descriptions.
  • Figure 2: Importance-based Light Sampling. We render RGB images (bottom) from our generated PBR maps and a sampled light source as input (top). We employ multinomial importance sampling based using the inverse roughness to select less rough pixels more often (red squares). The light direction is then the viewing direction to the pixel reflected by its normal. The rendered images thus contain more specular effects, which provides better gradients during training.
  • Figure 3: Editable Image Generation. Our generated PBR maps can be edited and utilized in standard physically-based rendering frameworks to produce RGB renderings. Here, we place a light source on top of the scene at constant elevation and rotate it around the vertical axis. From top to bottom we show, (1): RGB renderings with different light source positions; (2): manual edit of the albedo (desaturate the moon color); (4): lower roughness and higher metallic value (more glossy, mirror-like reflections).
  • Figure 4: Scene Texturing. We can use our method for scene texturing using score distillation SceneTex. Given a scene geometry, first, we condition our method on the rendered normal maps to produce the remaining PBR maps. Through iterative optimization, we obtain realistic PBR textures for the whole scene. Then, we similarly optimize for normal map textures to obtain fine geometric details, conditioned on rendered material maps. This showcases the potential of direct PBR map generation to democratize scene texturing from only text as input.
  • Figure 5: Rendering comparisons. We show sample PBR maps of our method and baselines as well as rendered RGB images under two different lighting conditions. We use a diverse set of text prompts to produce our PBR maps, as well as the input RGB images for the baseline methods. This highlights our models' capability to retain the generalized prior of the pretrained text-to-image model. Our method better captures the semantic meaning of the individual intrinsic properties. For example, there are no baked-in lighting effects in the albedo, and the metallic/roughness maps are sharper with more intricate details. This leads to more realistic renderings and lighting effects.
  • ...and 15 more figures