Table of Contents
Fetching ...

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

TL;DR

HiMat introduces a memory-efficient diffusion framework for native 4K SVBRDF generation by combining a deep compression autoencoder (DC-AE) with a linear-attention diffusion transformer (DiT) and a lightweight CrossStitch module to enforce cross-map consistency. It addresses two core challenges: (1) generating multiple 4K reflectance maps with a reduced pixel budget and (2) maintaining pixel-perfect alignment across maps without costly global attention. The approach also enriches material diversity through prompt augmentation with large-language models and system prompts, leveraging strong priors from pretrained models. Empirical results show HiMat delivers high-fidelity, diverse 4K SVBRDFs with practical runtimes on consumer GPUs and generalizes to intrinsic decomposition tasks, establishing a scalable foundation for high-resolution material generation.

Abstract

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

TL;DR

HiMat introduces a memory-efficient diffusion framework for native 4K SVBRDF generation by combining a deep compression autoencoder (DC-AE) with a linear-attention diffusion transformer (DiT) and a lightweight CrossStitch module to enforce cross-map consistency. It addresses two core challenges: (1) generating multiple 4K reflectance maps with a reduced pixel budget and (2) maintaining pixel-perfect alignment across maps without costly global attention. The approach also enriches material diversity through prompt augmentation with large-language models and system prompts, leveraging strong priors from pretrained models. Empirical results show HiMat delivers high-fidelity, diverse 4K SVBRDFs with practical runtimes on consumer GPUs and generalizes to intrinsic decomposition tasks, establishing a scalable foundation for high-resolution material generation.

Abstract

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

Paper Structure

This paper contains 19 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Overview.Left: Given text instructions, our framework generates 4K SVBRDF maps through a latent denoising pipeline based on linear DiT (Sec. \ref{['sssec:dit']}), with outputs reconstructed by a deep compression autoencoder (DC-AE) (Sec. \ref{['sssec:dcae']}). CrossStitch layers (Sec. \ref{['ssec:CrossStitch']}) are integrated into the linear DiT block after each linear attention layer. The combination of linear DiT and DC-AE enable efficient ultra-high-resolution generation, while the CrossStitch design ensures consistency across maps. Right: Architecture of our modified DiT block (cross-attention omitted for clarity). A lightweight convolutional CrossStitch module enables localized feature exchange across maps, ensuring pixel alignment.
  • Figure 2: Normal map reconstruction quality with DC-AE. Color bias in the reconstructed normals indicates distribution mismatch with ground truth, motivating our fine-tuning of the decoder for SVBRDF maps.
  • Figure 3: Textual description comparison. MatSynth vecchio2023matsynth provides short keyword labels, whereas our method generates rich, perceptually faithful descriptions that better capture material appearance and structure.
  • Figure 4: Visual comparison between HiMat, ReflectanceFusion xue2024reflectancefusion, and MatFuse matfusion. ReflectanceFusion exhibits baked-in lighting artifacts and is limited to $256\times256$ resolution. MatFuse suffers from reduced realism and diversity due to training exclusively on synthetic data at $512\times512$ resolution. In contrast, HiMat delivers diverse and high-fidelity 4K materials with details.
  • Figure 5: Further visual comparison between HiMat, ReflectanceFusion xue2024reflectancefusion and MatFuse vecchio2024matfuse.
  • ...and 9 more figures