Table of Contents
Fetching ...

LumiX: Structured and Coherent Text-to-Intrinsic Generation

Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka

TL;DR

LumiX tackles the challenge of generating a coherent set of intrinsic scene maps from text by introducing a structured diffusion framework. It couples a Query-Broadcast Attention mechanism to enforce pixel-level alignment across multiple intrinsic properties and a Tensor LoRA that efficiently models cross-map relations, enabling stable joint training. The approach yields superior cross-map alignment and perceptual quality, and additionally supports image-conditioned intrinsic decomposition within the same framework. Ablation studies show the critical roles of the proposed attention and tensor-based adaptations, with strong generalization to in-the-wild data and competitive intrinsic decomposition performance. Overall, LumiX advances unified, physically grounded text-to-intrinsic generation and sets the stage for scaling to broader intrinsic properties and larger datasets.

Abstract

We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

LumiX: Structured and Coherent Text-to-Intrinsic Generation

TL;DR

LumiX tackles the challenge of generating a coherent set of intrinsic scene maps from text by introducing a structured diffusion framework. It couples a Query-Broadcast Attention mechanism to enforce pixel-level alignment across multiple intrinsic properties and a Tensor LoRA that efficiently models cross-map relations, enabling stable joint training. The approach yields superior cross-map alignment and perceptual quality, and additionally supports image-conditioned intrinsic decomposition within the same framework. Ablation studies show the critical roles of the proposed attention and tensor-based adaptations, with strong generalization to in-the-wild data and competitive intrinsic decomposition performance. Overall, LumiX advances unified, physically grounded text-to-intrinsic generation and sets the stage for scaling to broader intrinsic properties and larger datasets.

Abstract

We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

Paper Structure

This paper contains 26 sections, 15 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: LumiX for Text-to-Intrinsic Generation. Given a text prompt, LumiX jointly generates a coherent set of intrinsic maps, including RGB color, albedo, irradiance, depth, and normal. Built on a powerful diffusion prior, it produces diverse and physically grounded intrinsic images, and can also perform image-conditioned intrinsic decomposition even though it is trained with text-only conditioning.
  • Figure 2: Overview of LumiX. Our goal is to generate a coherent set of intrinsic maps from text. Left: Training. Multiple intrinsic images are encoded into the latent space and concatenated along the batch dimension. We introduce Query-Broadcast Attention (\ref{['sec:cross-amp-attn']}) to ensure pixel alignment across properties, and Tensor LoRA (\ref{['sec:tensor-lora']}) to efficiently finetune the $KV$ projections for each property. Different timesteps are assigned to different properties for flexible conditioning. Right: Inference. Given a text or image input, LumiX jointly outputs all intrinsic maps in a single forward pass, supporting both text-to-intrinsic generation and intrinsic decomposition.
  • Figure 3: Visual comparison of Attention and LoRA designs. Using vanilla attention with separate LoRA leads to the weakest alignment. Replacing it with our Tensor LoRA improves consistency and quality. IntrinsiX without its first training stage becomes unstable, while substituting Tensor LoRA alleviates collapse but still struggles to distinguish different modality characteristics. Our Query-Broadcast Attention combined with Hybrid or Tensor LoRA achieves the best results, producing consistent and high-quality intrinsic maps.
  • Figure 4: Text-to-Intrinsic Generation Comparison. Both models are built upon FLUX. While IntrinsiX tends to overfit to specific indoor scenes, our method preserves FLUX's strong prior and produces consistent, high-quality intrinsic maps even for out-of-domain prompts.
  • Figure 5: Intrinsic Decomposition Comparison. Our method performs intrinsic decomposition on in-the-wild data, producing albedo maps with less embedded lighting and generating consistent, high-quality intrinsic maps across all properties.
  • ...and 7 more figures