VideoMatGen: PBR Materials through Joint Generative Modeling

Jon Hasselgren; Zheng Zeng; Milos Hasan; Jacob Munkberg

VideoMatGen: PBR Materials through Joint Generative Modeling

Jon Hasselgren, Zheng Zeng, Milos Hasan, Jacob Munkberg

Abstract

We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

VideoMatGen: PBR Materials through Joint Generative Modeling

Abstract

Paper Structure (31 sections, 2 equations, 9 figures, 3 tables)

This paper contains 31 sections, 2 equations, 9 figures, 3 tables.

Introduction
Related Work
Diffusion Models.
Differentiable Rendering.
Texture and material extraction using diffusion.
Diffusion-based 3D asset generation.
Intrinsic decomposition of images/videos.
Joint generative modeling
Method
Base Video Model Architecture
Per-frame encoding
Joint generative modeling
Finetuning
Dataset
Transfer multi-view intrinsics to texture space
...and 16 more sections

Figures (9)

Figure 1: Given 3D models and text prompts, we generate unique high quality PBR materials for each 3D part using a finetuned video diffusion model. Our generated materials are directly applicable in content creation applications. Here we show a Physical AI training application, applying the generated materials to a virtual factory setting. On the right, we show three variations of generated materials (from the same detailed text prompts and different random seeds) for an industrial robot asset with 19 parts.
Figure 2: Our method starts from a known 3D model and a text prompt. We first render videos of normal maps and world space positions. Next, these conditions are encoded into latent space, using a pretrained encoder, $\mathcal{E}$, to produce latent conditions, $\textbf{z}^{\mathbf{I}}$. These are concatenated with noisy latents, $\textbf{z}_\tau^{\mathbf{mat}}$, representing material modalities, along the channel dimension. The latents and text prompt are then passed to our finetuned video model, which generates a denoised latent, $\hat{\textbf{z}}^{\mathbf{mat}}$. The denoised latent is decoded into videos of the intrinsic material channels: base color, roughness, metallicity, and height, using a custom VAE decoder $\mathcal{D}_{\mathrm{pbr}}$ which decodes all material properties jointly. Finally, we project the generated views into texture space to extract high quality, standard PBR materials.
Figure 3: Material generation. We compare against Hunyuan3D Paint 2.1 he2025materialmvp (image and text guided versions) and VideoMat munkberg2025videomat (text) on three example meshes from the BlenderVault litman2025materialfusion dataset. We encourage the reader to zoom in and compare the quality of the intrinsics (base color, roughness, metallicity), as well as to see the supplementary materials.
Figure 4: Left: Our method predicts a height (bump) map, which improves the visual richness of the generated material. Right: corresponding rendering without bump map.
Figure 5: We generate three materials from the same text prompt (see supplemental), each with a unique random seed. This results in subtle variations of materials for the two examples.
...and 4 more figures

VideoMatGen: PBR Materials through Joint Generative Modeling

Abstract

VideoMatGen: PBR Materials through Joint Generative Modeling

Authors

Abstract

Table of Contents

Figures (9)