Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction
Mikhail Tsitsvero, Atsuyuki Nakao, Hisaki Ikebata
TL;DR
The paper tackles the challenge of learning from heterogeneous chemical-process data by proposing a universal directed-tree representation that unifies text, structures, and numeric data into process-graph form. It then introduces a multi-modal graph neural network with property-conditioned attention that learns transferable embeddings from a large, diverse corpus and enables data-efficient fine-tuning to domain-specific tasks. Demonstrated on a UV-absorber formulation task with only 153 domain samples, the pretrained backbone achieves $R^2 = 0.96$, highlighting strong cross-domain transfer. The work also discusses future extensions to richer modalities, probabilistic calibration, and autonomous experimental design, aiming to accelerate materials discovery through integrated, cross-domain process reasoning.
Abstract
Experimental validation of chemical processes is slow and costly, limiting exploration in materials discovery. Machine learning can prioritize promising candidates, but existing data in patents and literature is heterogeneous and difficult to use. We introduce a universal directed-tree process-graph representation that unifies unstructured text, molecular structures, and numeric measurements into a single machine-readable format. To learn from this structured data, we developed a multi-modal graph neural network with a property-conditioned attention mechanism. Trained on approximately 700,000 process graphs from nearly 9,000 diverse documents, our model learns semantically rich embeddings that generalize across domains. When fine-tuned on compact, domain-specific datasets, the pretrained model achieves strong performance, demonstrating that universal process representations learned at scale transfer effectively to specialized prediction tasks with minimal additional data.
