Table of Contents
Fetching ...

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang, Xintong Han, Zhuo Chen, Beibei Wang, Chunchao Guo

TL;DR

MatPedia addresses the lack of a unified representation for RGB appearance and PBR properties by introducing a joint RGB-PBR latent space learned from a five-frame input. A video diffusion transformer, initialized from large-scale RGB-video priors and fine-tuned with LoRA, enables text-to-material, image-to-material, and intrinsic decomposition at high resolution. The hybrid MatHybrid-410K dataset combines RGB appearance data with PBR materials to leverage abundant RGB data for improving PBR synthesis. The results show improved quality and diversity over task-specific baselines, demonstrating a scalable foundation for realistic material generation in 3D assets.

Abstract

Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

TL;DR

MatPedia addresses the lack of a unified representation for RGB appearance and PBR properties by introducing a joint RGB-PBR latent space learned from a five-frame input. A video diffusion transformer, initialized from large-scale RGB-video priors and fine-tuned with LoRA, enables text-to-material, image-to-material, and intrinsic decomposition at high resolution. The hybrid MatHybrid-410K dataset combines RGB appearance data with PBR materials to leverage abundant RGB data for improving PBR synthesis. The results show improved quality and diversity over task-specific baselines, demonstrating a scalable foundation for realistic material generation in 3D assets.

Abstract

Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native synthesis that substantially surpasses existing approaches in both quality and diversity.

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Pipeline of the proposed MatPedia framework. Left: The 3D VAE encodes a shaded RGB frame together with optional PBR maps into a joint RGB-PBR latent representation, where PBR maps are conditioned on the RGB appearance. This compact representation supports both (a) shaded RGB decoding and (b) PBR decoding at native $1024\times1024$ resolution. Right: The DiT, initialized from large-scale video generation models and adapted via LoRA, operates on the joint latents to perform three tasks: Text-to-PBR (generate RGB/PBR from material captions), Image-to-PBR (generate planar RGB/PBR from distorted input images), and Material Decomposition (recover PBR maps from natural images). DiT blocks integrate self-attention (SA), cross-attention (CA), and LoRA modules to enable flexible conditioning across modalities.
  • Figure 2: Examples of planar material images from the RGB Appearance Dataset, generated by the Gemini 2.5 Flash Image model gemini2025flash.
  • Figure 3: Qualitative comparison of text-conditioned PBR material generation among our method, MatFuse vecchio2024matfuse, ControlMat vecchio2024controlmat, and MaterialPicker ma2024materialpicker. For each prompt, we show the generated PBR maps (Basecolor, Normal, Roughness, Metallic) followed by a render view under point-light illumination. We note that MatFuse generates a specular map rather than a metallic map.
  • Figure 4: Qualitative comparison of image-conditioned PBR generation. For each sample, the first column shows the distorted input image (cropped from the scene), and the second to last columns present the generated material maps together with a rendering under point-light illumination. Our method produces geometrically flattened and artifact-free maps, while MatFuse shows reduced roughness fidelity and Material Palette retains geometric distortions from the input.
  • Figure 5: Qualitative comparison of material decomposition. For each sample, the first column shows the planar input image, and the second to last columns present the generated material maps together with a rendering under environment lighting. Our method produces consistent structural patterns, yielding rendered views that closely match the input appearance.
  • ...and 2 more figures