Table of Contents
Fetching ...

Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction

Mikhail Tsitsvero, Atsuyuki Nakao, Hisaki Ikebata

TL;DR

The paper tackles the challenge of learning from heterogeneous chemical-process data by proposing a universal directed-tree representation that unifies text, structures, and numeric data into process-graph form. It then introduces a multi-modal graph neural network with property-conditioned attention that learns transferable embeddings from a large, diverse corpus and enables data-efficient fine-tuning to domain-specific tasks. Demonstrated on a UV-absorber formulation task with only 153 domain samples, the pretrained backbone achieves $R^2 = 0.96$, highlighting strong cross-domain transfer. The work also discusses future extensions to richer modalities, probabilistic calibration, and autonomous experimental design, aiming to accelerate materials discovery through integrated, cross-domain process reasoning.

Abstract

Experimental validation of chemical processes is slow and costly, limiting exploration in materials discovery. Machine learning can prioritize promising candidates, but existing data in patents and literature is heterogeneous and difficult to use. We introduce a universal directed-tree process-graph representation that unifies unstructured text, molecular structures, and numeric measurements into a single machine-readable format. To learn from this structured data, we developed a multi-modal graph neural network with a property-conditioned attention mechanism. Trained on approximately 700,000 process graphs from nearly 9,000 diverse documents, our model learns semantically rich embeddings that generalize across domains. When fine-tuned on compact, domain-specific datasets, the pretrained model achieves strong performance, demonstrating that universal process representations learned at scale transfer effectively to specialized prediction tasks with minimal additional data.

Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction

TL;DR

The paper tackles the challenge of learning from heterogeneous chemical-process data by proposing a universal directed-tree representation that unifies text, structures, and numeric data into process-graph form. It then introduces a multi-modal graph neural network with property-conditioned attention that learns transferable embeddings from a large, diverse corpus and enables data-efficient fine-tuning to domain-specific tasks. Demonstrated on a UV-absorber formulation task with only 153 domain samples, the pretrained backbone achieves , highlighting strong cross-domain transfer. The work also discusses future extensions to richer modalities, probabilistic calibration, and autonomous experimental design, aiming to accelerate materials discovery through integrated, cross-domain process reasoning.

Abstract

Experimental validation of chemical processes is slow and costly, limiting exploration in materials discovery. Machine learning can prioritize promising candidates, but existing data in patents and literature is heterogeneous and difficult to use. We introduce a universal directed-tree process-graph representation that unifies unstructured text, molecular structures, and numeric measurements into a single machine-readable format. To learn from this structured data, we developed a multi-modal graph neural network with a property-conditioned attention mechanism. Trained on approximately 700,000 process graphs from nearly 9,000 diverse documents, our model learns semantically rich embeddings that generalize across domains. When fine-tuned on compact, domain-specific datasets, the pretrained model achieves strong performance, demonstrating that universal process representations learned at scale transfer effectively to specialized prediction tasks with minimal additional data.

Paper Structure

This paper contains 29 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Experimental process for a resin composite containing alumina filler. The alumina is pretreated through heat treatment and milling. After that, the specific gravity of the alumina is measured. Subsequently, a slurry mixture of resin, curing agent, curing accelerator, solvent, and filler is coated onto a film, dried, and the thermal conductivity of the cured resin composite is measured.
  • Figure 2: Two-step chemical process encoded as our directed-tree input. In the 1st process (blue region), we synthesize a polyamide by combining adipic acid (SMILES O=C(O)CCCCCC(=O)O, 146 g) with hexamethylene diamine (SMILES NCCCCCCN, 116 g) in a synthesis step lasting 2 h. In the 2nd process (green region), the resulting product is mixed with Alumina (Filler, 20.0 g) in a mixing step at 220.0 $^\circ$C, and the target property, glass-transition temperature Tg, is recorded as 100.0 $^\circ$C. Node types (mix for structural grouping, txt, value, SMILES) and labeled edges (material, process, condition, properties, id) define the directed tree used by our model; the output product subtree of the first process is linked into the second process via an id edge from the downstream material node.
  • Figure 3: The architecture of the multi-modal, multi-task graph neural network. The model takes the directed tree representation of a chemical process as input. Different node types are processed by specialized encoders: a graph neural network for molecular structures (SMILES), LLM embedding for text, and separate network for numerical values. The resulting embeddings are then passed to a main process GNN that performs message passing on the entire graph. Cross-modal attention pools information into a compact latent representation (property-conditioned tokens), which is shared across tasks. Task-specific output heads (indexed by document+property) map the shared latent to the corresponding prediction.
  • Figure 4: Node count distribution in the directed-tree graphs of the UV-absorber fine-tuning dataset. The 153 process graphs span from compact formulations (22 nodes) to detailed multi-component processes (364 nodes). This structural diversity reflects the range of experimental descriptions captured in the patent and demonstrates the flexibility of our directed-tree representation.
  • Figure 5: Predicted versus true values under three fine-tuning regimes on the UV-absorber task: (a) GNN-fixed, (b) adaptor, and (c) full-parameter. Points are pooled across cross-validation folds; the red dashed line indicates the perfect predictions, points of the same color belong to the same cross-validation fold.
  • ...and 3 more figures