IntrinsiX: High-Quality PBR Generation using Image Priors
Peter Kocsis, Lukas Höllein, Matthias Nießner
TL;DR
IntrinsiX directly generates physically-based rendering maps (albedo, roughness, metallic, normals) from text, addressing the limitation of baked lighting in typical text-to-image outputs. It decomposes the problem into per-property priors learned via LoRA adapters and then aligns these priors with a cross-intrinsic attention mechanism, guided by a rendering loss that grounds the outputs in image-space signals. The two-stage training, combined with importance-based light sampling, yields semantically coherent, high-quality PBR maps that generalize to out-of-distribution prompts and support downstream tasks like relighting, editing, and room-scale PBR texturing. Experimental results show clear improvements over intrinsic decomposition baselines and demonstrate practical applicability in graphics pipelines, including 3D scene texturing. Overall, the work expands the role of text-conditioned diffusion models from RGB image synthesis to direct PBR map generation, enabling more flexible content creation for gaming and VR.
Abstract
We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.
