Table of Contents
Fetching ...

Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge

Yang Jeong Park, Mayank Kumaran, Chia-Wei Hsu, Elsa Olivetti, Ju Li

TL;DR

This work tackles data scarcity and biased sampling in crystal science by introducing CLaC, a multimodal contrastive model that jointly embeds crystal graphs and language. By training on 126k GPT-synthesized crystal-text pairs (and supplementary literature-derived text), CLaC learns a shared latent space via inter-modal and intra-modal alignment, achieving state-of-the-art zero-shot retrieval and strong performance on NER and PAC tasks. The approach leverages graph encoders (CGCNN, PaiNN) and text encoders (SciBERT, MatSciBERT) with synthetic data to overcome limited crystal data, demonstrating robust cross-modal generalization and meaningful latent-space organization. The work highlights the potential of synthetic data and multimodal supervision to advance materials discovery, while noting limitations to crystals-only domains and the need to extend to polycrystals and MOFs. Overall, CLaC represents a scalable, data-efficient pathway toward language-guided crystal design and retrieval.

Abstract

Artificial intelligence (AI) is increasingly used for the inverse design of materials, such as crystals and molecules. Existing AI research on molecules has integrated chemical structures of molecules with textual knowledge to adapt to complex instructions. However, this approach has been unattainable for crystals due to data scarcity from the biased distribution of investigated crystals and the lack of semantic supervision in peer-reviewed literature. In this work, we introduce a contrastive language-crystals model (CLaC) pre-trained on a newly synthesized dataset of 126k crystal structure-text pairs. To demonstrate the advantage of using synthetic data to overcome data scarcity, we constructed a comparable dataset extracted from academic papers. We evaluate CLaC's generalization ability through various zero-shot cross-modal tasks and downstream applications. In experiments, CLaC achieves state-of-the-art zero-shot generalization performance in understanding crystal structures, surpassing latest large language models.

Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge

TL;DR

This work tackles data scarcity and biased sampling in crystal science by introducing CLaC, a multimodal contrastive model that jointly embeds crystal graphs and language. By training on 126k GPT-synthesized crystal-text pairs (and supplementary literature-derived text), CLaC learns a shared latent space via inter-modal and intra-modal alignment, achieving state-of-the-art zero-shot retrieval and strong performance on NER and PAC tasks. The approach leverages graph encoders (CGCNN, PaiNN) and text encoders (SciBERT, MatSciBERT) with synthetic data to overcome limited crystal data, demonstrating robust cross-modal generalization and meaningful latent-space organization. The work highlights the potential of synthetic data and multimodal supervision to advance materials discovery, while noting limitations to crystals-only domains and the need to extend to polycrystals and MOFs. Overall, CLaC represents a scalable, data-efficient pathway toward language-guided crystal design and retrieval.

Abstract

Artificial intelligence (AI) is increasingly used for the inverse design of materials, such as crystals and molecules. Existing AI research on molecules has integrated chemical structures of molecules with textual knowledge to adapt to complex instructions. However, this approach has been unattainable for crystals due to data scarcity from the biased distribution of investigated crystals and the lack of semantic supervision in peer-reviewed literature. In this work, we introduce a contrastive language-crystals model (CLaC) pre-trained on a newly synthesized dataset of 126k crystal structure-text pairs. To demonstrate the advantage of using synthetic data to overcome data scarcity, we constructed a comparable dataset extracted from academic papers. We evaluate CLaC's generalization ability through various zero-shot cross-modal tasks and downstream applications. In experiments, CLaC achieves state-of-the-art zero-shot generalization performance in understanding crystal structures, surpassing latest large language models.

Paper Structure

This paper contains 28 sections, 11 equations, 14 figures.

Figures (14)

  • Figure 1: Conceptual schematic diagram of the data pipeline and model architecture of the contrastive language-crystals multimodal pretraining.a, The data processing flow for generating GPT-synthesized narrative multimodal pair data. b, The data processing flow for generating multimodal data from natural language academic papers. c, The overall architecture of the model. The original graph, text, and corresponding augmented entities pass through graph and text encoders, which feed each projector, leading to a multimodal projector that aligns both data types.
  • Figure 2: Pipeline of downstream tasks.a, Text-crystal zero-shot retrieval and crystal-text zero-shot retrieval. b, Zero-shot classification. c, Paper abstract classification. d, Named entity recognition.
  • Figure 3: Zero-shot generalization ability.a, Text-to-crystal zero-shot retrieval accuracy within 1,024 candidates pool. 'CS' signifies that the graph encoder is a CGCNN and the text encoder is a SciBERT. 'P' indicates PaiNN, and 'M' indicates MatSciBERT. b, Zero-shot classification accuracy. c, A graphical representation of the material-application similarity matrix. The matrix showcases six different materials – ${\rm Li_7La_3Zr_2O_{12}}$, ${\rm GaAs}$, ${\rm BaTiO_3}$, ${\rm B_4C}$, ${\rm Zr}$, and ${\rm RuO_2}$ – and their associations with six distinct application categories: solid-state batteries, fuel cells, semiconductors, nuclear structural materials, supercapacitors, and neutron shielding. Colored dots indicate the relevance of each material to the respective application, with the color coding corresponding to the degree of suitability based on underlying material properties.
  • Figure 4: Comparison of fine-tuning performance for downstream tasks.a, Radar chart for named entity recognition models. The chart illustrates the evaluation metrics across various categories for three models: SciBERT, MatSciBERT, and CLaC with PaiNN and SciBERT encoders denoted as CLaC-PS. Each axis represents a different metric, and the distance from the center indicates the score achieved. The overall F1 scores for each model are highlighted. b, Validation accuracy (valid. acc.) and test accuracy (test acc.) of paper abstract classification performances for glass vs. non-glass and Li- vs. Na-ion battery. Each experiment averaged over three random seeds.
  • Figure 5: Visualization of attention heatmaps from various models.a, Schematic diagram of attention heatmap calculation. The four heatmaps represent attention patterns across different layers and heads of b SciBERT, c SciBERT as the text encoder of CLaC after joint training, d MatSciBERT, and e MatSciBERT as the text encoder of CLaC after joint training. Brighter colors indicate higher attention values, showing how the model focuses on different parts of the input sequence.
  • ...and 9 more figures