Multimodal Foundation Models for Material Property Prediction and Discovery

Viggo Moro; Charlotte Loh; Rumen Dangovski; Ali Ghorashi; Andrew Ma; Zhuo Chen; Samuel Kim; Peter Y. Lu; Thomas Christensen; Marin Soljačić

Multimodal Foundation Models for Material Property Prediction and Discovery

Viggo Moro, Charlotte Loh, Rumen Dangovski, Ali Ghorashi, Andrew Ma, Zhuo Chen, Samuel Kim, Peter Y. Lu, Thomas Christensen, Marin Soljačić

TL;DR

MultiMat presents a multimodal foundation-model approach for materials science by aligning embeddings from four modalities—crystal structure $C$, density of states $\rho(E)$, charge density $n_{e}(\mathbf{r})$, and text $T$—into a shared latent space. It extends CLIP with AllPairsCLIP and AnchoredCLIP to train on multiple modalities, enabling self-supervised pre-training on the Materials Project and deployment to downstream tasks. The pre-trained crystal encoder delivers state-of-the-art performance on crystal-property prediction and enables rapid material discovery via latent-space similarity, with interpretable embeddings revealed through dimensionality reduction. These results underscore the potential of multimodal pre-training to accelerate materials discovery and provide transferable representations that generalize beyond single-modality designs.

Abstract

Artificial intelligence is transforming computational materials science, improving the prediction of material properties, and accelerating the discovery of novel materials. Recently, publicly available material data repositories have grown rapidly. This growth encompasses not only more materials but also a greater variety and quantity of their associated properties. Existing machine learning efforts in materials science focus primarily on single-modality tasks, i.e. relationships between materials and a single physical property, thus not taking advantage of the rich and multimodal set of material properties. Here, we introduce Multimodal Learning for Materials (MultiMat), which enables self-supervised multi-modality training of foundation models for materials. We demonstrate our framework's potential using data from the Materials Project database on multiple axes: (i) MultiMat achieves state-of-the-art performance for challenging material property prediction tasks; (ii) MultiMat enables novel and accurate material discovery via latent space similarity, enabling screening for stable materials with desired properties; and (iii) MultiMat encodes interpretable emergent features that may provide novel scientific insights.

Multimodal Foundation Models for Material Property Prediction and Discovery

TL;DR

MultiMat presents a multimodal foundation-model approach for materials science by aligning embeddings from four modalities—crystal structure

, density of states

, charge density

, and text

—into a shared latent space. It extends CLIP with AllPairsCLIP and AnchoredCLIP to train on multiple modalities, enabling self-supervised pre-training on the Materials Project and deployment to downstream tasks. The pre-trained crystal encoder delivers state-of-the-art performance on crystal-property prediction and enables rapid material discovery via latent-space similarity, with interpretable embeddings revealed through dimensionality reduction. These results underscore the potential of multimodal pre-training to accelerate materials discovery and provide transferable representations that generalize beyond single-modality designs.

Abstract

Paper Structure (27 sections, 6 equations, 4 figures)

This paper contains 27 sections, 6 equations, 4 figures.

Introduction
Results
Modalities and Architecture
Overview of Multimodal Pre-training Methods
Crystal Property Prediction
Material Discovery via Latent Space Similarity
Interpretability of MultiMat Features
Discussion
Methods
Encoder Architectures
Crystal structure encoder
Density of states (DOS) encoder
Charge density encoder
Text encoder
Multimodal Pre-training Methods
...and 12 more sections

Figures (4)

Figure 1: The Multimodal Learning for Materials (MultiMat) approach.a, Crystal ($C$), DOS ($\rho(E)$), charge density ($n_e(\mathbf{r})$), and text ($T$) encoders map each modality to embeddings in a shared multimodal latent space (center). MultiMat's training objective aligns the embeddings of different modalities corresponding to the same material. b, Application of MultiMat in improved prediction of materials' properties. The $C$ encoder from (a) is transferred, and a randomly initialized linear head is trained jointly with the transferred encoder to predict material properties. c, Application of MultiMat in material discovery. The DOS encoder embeds a target DOS (in blue). In the shared latent space, the closest crystal embedding (in red) from a large collection of crystal embeddings is selected. Since the embeddings of DOS and crystal are aligned during training, the crystal whose embedding is closest to the target DOS embedding is highly likely to have a DOS (in red) that closely resembles the target. Therefore, this crystal is identified as the best candidate. d, Application of MultiMat in enabling interpretability. We visualize the latent space of the crystal encoder using dimensionality reduction to reveal information about properties of materials that are implicitly encoded in the embeddings.
Figure 2: Crystal property prediction. Mean absolute error (MAE) for the prediction of various crystal properties across baseline methods and MultiMat. Methods are grouped by color according to the number of modalities, $M$, selected from the set of all modalities $\{ C, \rho(E), n_{e}(\mathbf{r}), T\}$ (with $C$ always selected). Results for the $M=2$ and $M=3$ cases show the average performance over all allowed combinations for each category (individual experiments reported in the Supplementary Information) and error bars give the standard deviation over 3 random seeds, averaged over all experiments within that category.
Figure 3: Material discovery via latent space similarity.a, Top-$k$ accuracies for cross-modality retrieval using encoders pre-trained with AnchoredCLIP, averaged over the test set. b, Normalized MAE between the target $\rho(E)$ from the test set and the $\rho(E)$ corresponding to the best crystal candidate from the training set, identified through our latent space similarity approach when the number of closest neighbors considered is varied. The best crystal candidate is selected from a set of crystals whose embeddings are the closest neighbors to the target $\rho(E)$ in the shared latent space, where the chosen crystal has a $\rho(E)$ with the smallest normalized MAE compared to the target $\rho(E)$. MAE values are normalized by the area of target $\rho(E)$ (both computed in the $(-5 \ \textrm{eV}, 5 \ \textrm{eV})$ range) and the values reported here are averaged over the whole test set. c, Two examples of the $\rho(E)$ corresponding to the best $C$ candidate found via latent space similarity overlaid with the target $\rho(E)$ of the material discovery process.
Figure 4: Interpretability of crystal embeddings.a, Crystal embeddings after dimensionality reduction by UMAP are shown, with each embedding color-coded by one of the seven crystal systems. Some clustering based on the crystal system can be observed. b, Visualization of these dimensionality-reduced embeddings after color-coding according to each material's formation energy. c, Visualization of these dimensionality-reduced embeddings after color-coding based on whether each material is a metal or not.

Multimodal Foundation Models for Material Property Prediction and Discovery

TL;DR

Abstract

Multimodal Foundation Models for Material Property Prediction and Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (4)