Table of Contents
Fetching ...

VideoNeuMat: Neural Material Extraction from Generative Video Models

Bowen Xue, Saeed Hadadan, Zheng Zeng, Fabrice Rousselle, Zahra Montazeri, Milos Hasan

TL;DR

VideoNeuMat addresses the data bottleneck in photorealistic material authoring by learning from internet-scale video diffusion models. It finetunes a large video model to act as a virtual gonioreflectometer and then uses a Large Reconstruction Model to map short material videos to NeuMIP-based neural materials, enabling relighting on novel shapes and views. The two-stage approach yields materials with superior realism and diversity compared to limited synthetic data and prior diffusion-based methods, effectively transferring knowledge from video priors to standalone 3D assets. This provides a practical, data-efficient pathway for producing reusable neural materials for photorealistic rendering in diverse scenes and lighting conditions.

Abstract

Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.

VideoNeuMat: Neural Material Extraction from Generative Video Models

TL;DR

VideoNeuMat addresses the data bottleneck in photorealistic material authoring by learning from internet-scale video diffusion models. It finetunes a large video model to act as a virtual gonioreflectometer and then uses a Large Reconstruction Model to map short material videos to NeuMIP-based neural materials, enabling relighting on novel shapes and views. The two-stage approach yields materials with superior realism and diversity compared to limited synthetic data and prior diffusion-based methods, effectively transferring knowledge from video priors to standalone 3D assets. This provides a practical, data-efficient pathway for producing reusable neural materials for photorealistic rendering in diverse scenes and lighting conditions.

Abstract

Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.
Paper Structure (41 sections, 3 equations, 8 figures, 2 tables)

This paper contains 41 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our method consists of two stages. First, we finetune a large video diffusion model into a virtual gonioreflectometer that generates a structured material video under a moving light/camera based on text or image prompts. We then infer a NeuMIP-style neural material from 17 frames using a feed-forward Large Reconstruction Model (LRM), which is trained using a rendering loss to produce a material that works under new views and lights. The resulting materials enable relighting and use on novel shapes.
  • Figure 2: For a few example materials, we show the first generated frame, the LRM-reconstructed rendering of the same frame, the U-offset map, and renderings of the material on a curved surface under different environment illuminations. The offset amount (for the neural parallax mapping effect of NeuMIP) is shown along the U-axis in texture space. A red pixel means texture look-up a few pixels to the left, while blue means offset to the right). Not only our approach can reconstruct the generated frames faithfully, but also it generates meaningful offset maps, indicating a correct understanding of the material geometry.
  • Figure 3: We visualize the camera/light trajectories used in our training dataset to the scale. (a) Gonioreflectometer's trajectory made of two phases with total of 81 frames. (b) To train the LRM, in addition to (a), we form 81 frames by cross joining 9 camera and 9 light locations as depicted.
  • Figure 4: We compare direct optimization with LRM by showing renderings of unseen views during training. Direct optimization fails to reach the correct offset in diffusion materials, likely due to the misalignments in the generated videos. Even for synthetic materials with no misalignments, the LRM has better generalization to unseen views, since it was trained on more camera/light settings. Full videos are included in the supplementary video.
  • Figure 5: LRM upsmapler. A linear layer as the usampler of the DiT tokens processes each token separately from the others, causing patch artifacts, while the convolutional layers in the VAE decoder resolve this issue.
  • ...and 3 more figures