Table of Contents
Fetching ...

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

TL;DR

Procedural materials are represented as directed acyclic graphs that generate texture maps for physically-based rendering, but prior work largely relies on text-only graph representations. MultiMat introduces a multimodal program synthesis framework that conditions node-graph generation on visual feedback from intermediate graph states $G_t$ and outputs $I_t$, organized as a multimodal program tree $\mathcal{T}$, and uses a transpiler to convert graphs into Substance Designer formats. An incremental tree search with automatic error repair validates and修 backs up generations to ensure correctness and efficiency during inference. Trained on a large, production-grade Substance Designer dataset and evaluated on unconditional tasks, MultiMat achieves state-of-the-art visual fidelity and generation efficiency, offering a practical path toward accessible, production-grade procedural materials for artists.

Abstract

Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

TL;DR

Procedural materials are represented as directed acyclic graphs that generate texture maps for physically-based rendering, but prior work largely relies on text-only graph representations. MultiMat introduces a multimodal program synthesis framework that conditions node-graph generation on visual feedback from intermediate graph states and outputs , organized as a multimodal program tree , and uses a transpiler to convert graphs into Substance Designer formats. An incremental tree search with automatic error repair validates and修 backs up generations to ensure correctness and efficiency during inference. Trained on a large, production-grade Substance Designer dataset and evaluated on unconditional tasks, MultiMat achieves state-of-the-art visual fidelity and generation efficiency, offering a practical path toward accessible, production-grade procedural materials for artists.

Abstract

Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Procedural materials offer powerful control over the appearance of 3D objects through a few high-level parameters. Here, a production-grade example (left) with the images obtained using two distinct parameter sets A and B (right).
  • Figure 2: Architecture overview of MultiMat () during inference. The system constructs a multimodal program tree $\mathcal{T}$ by iteratively generating node definitions. At each step $t$, the system derives a graph $G_t$ of valid nodes along with corresponding intermediate outputs $I_t$ by traversing $\mathcal{T}$, which may contain both valid and invalid nodes, to generate the next node $v_{t+1}$. When transpilation and execution succeed, the system advances with an updated graph $G_{t+1}$ and outputs $I_{t+1}$. If errors occur, it reverts to a previous state $(G_{\leq t}, I_{\leq t})$. The generation process initiates from either an input image or unconditionally using a beginning-of-sequence token (<bos>). Following optional parameter optimization (cf. §\ref{['sec:conditional']}), the final procedural material can be applied to any target geometry for rendering.
  • Figure 3: Visualization of the two conditioning approaches used by MultiMat () for generating node definition $v_{t+1}$. In the graph-conditioned approach (1), MultiMat () processes the graph $G_t$ as a visual representation similar to human perception. In the mixed-conditioned approach (2), MultiMat () receives $G_t$ as a multimodal program where <img> tokens are replaced with their corresponding vision encoder representations from $I_t$.
  • Figure 4: Visualization of our inference algorithm as a tree search. Tree nodes represent generated node definitions, and edges represent possible continuations. The algorithm proceeds as follows: generation continues until an invalid state (✗) is encountered (1), triggering backtracking to the previous node; from this point, if a valid node (✓) is generated, normal generation resumes (2a), but if invalid outputs persist (2b), the algorithm backtracks further until a valid path is found (3).