Table of Contents
Fetching ...

Compositional Representation of Polymorphic Crystalline Materials

Namkyeong Lee, Heewoong Noh, Gyoung S. Na, Jimeng Sun, Tianfan Fu, Marinka Zitnik, Chanyoung Park

TL;DR

PCRL addresses the challenge of learning material representations from composition when crystal polymorphism causes one composition to map to multiple structures. It introduces a probabilistic composition encoder that models each composition as $p(\tilde{\mathbf{z}}^a|\mathbf{X}^a,\mathbf{A}^a) \sim \mathcal{N}(\mathbf{z}_{\mu}^a,\mathbf{z}_{\sigma}^a)$, with $\mathbf{z}_{\mu}^a=f_{\mu}^a(\mathbf{Z}^a)$ and $\mathbf{z}_{\sigma}^a=f_{\sigma}^a(\mathbf{Z}^a)$, and a structural graph encoder $f^b$ to guide alignment via a soft contrastive loss $\mathcal{L}_{\text{con}}$ and KL regularization $\mathcal{L}_{\text{KL}}$, optimized as $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{con}} + \beta \cdot \mathcal{L}_{\text{KL}}$. The framework aligns composition representations with their polymorphic structures through sampling and a soft matching probability, enabling transfer to diverse property prediction tasks without requiring explicit structure. Evaluations on 16 datasets—including DFT-derived and experimental targets across Band Gap, Formation Enthalpy, ESTM properties, and Matbench metrics—show that incorporating structural information and polymorphism-uncertainty yields superior, transferable representations, with uncertainties reflecting polymorphic complexity and guiding material discovery. The work demonstrates that universal compositional representations can be learned by jointly leveraging composition and structure while explicitly modeling uncertainty, offering practical impact for data-scarce real-world material discovery. Code is available at the authors’ GitHub, and experiments indicate PCRL’s robustness and applicability across diverse materials domains.

Abstract

Machine learning (ML) has seen promising developments in materials science, yet its efficacy largely depends on detailed crystal structural data, which are often complex and hard to obtain, limiting their applicability in real-world material synthesis processes. An alternative, using compositional descriptors, offers a simpler approach by indicating the elemental ratios of compounds without detailed structural insights. However, accurately representing materials solely with compositional descriptors presents challenges due to polymorphism, where a single composition can correspond to various structural arrangements, creating ambiguities in its representation. To this end, we introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information. Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation, and our analysis highlights its potential applicability of PCRL in material discovery. The source code for PCRL is available at https://github.com/Namkyeong/PCRL.

Compositional Representation of Polymorphic Crystalline Materials

TL;DR

PCRL addresses the challenge of learning material representations from composition when crystal polymorphism causes one composition to map to multiple structures. It introduces a probabilistic composition encoder that models each composition as , with and , and a structural graph encoder to guide alignment via a soft contrastive loss and KL regularization , optimized as . The framework aligns composition representations with their polymorphic structures through sampling and a soft matching probability, enabling transfer to diverse property prediction tasks without requiring explicit structure. Evaluations on 16 datasets—including DFT-derived and experimental targets across Band Gap, Formation Enthalpy, ESTM properties, and Matbench metrics—show that incorporating structural information and polymorphism-uncertainty yields superior, transferable representations, with uncertainties reflecting polymorphic complexity and guiding material discovery. The work demonstrates that universal compositional representations can be learned by jointly leveraging composition and structure while explicitly modeling uncertainty, offering practical impact for data-scarce real-world material discovery. Code is available at the authors’ GitHub, and experiments indicate PCRL’s robustness and applicability across diverse materials domains.

Abstract

Machine learning (ML) has seen promising developments in materials science, yet its efficacy largely depends on detailed crystal structural data, which are often complex and hard to obtain, limiting their applicability in real-world material synthesis processes. An alternative, using compositional descriptors, offers a simpler approach by indicating the elemental ratios of compounds without detailed structural insights. However, accurately representing materials solely with compositional descriptors presents challenges due to polymorphism, where a single composition can correspond to various structural arrangements, creating ambiguities in its representation. To this end, we introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information. Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation, and our analysis highlights its potential applicability of PCRL in material discovery. The source code for PCRL is available at https://github.com/Namkyeong/PCRL.
Paper Structure (46 sections, 12 equations, 10 figures, 14 tables)

This paper contains 46 sections, 12 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: (a) A crystal can have multiple descriptors for ML input, such as graphical and compositional descriptors. (b) Diamond and Graphite are polymorphic crystal structures of composition C. While these crystal structures share the same composition, their properties are completely different. (c) Graphical descriptors necessitate costly recalculation of crystal structures during the material synthesis process, thereby restricting the ML capability to the same bottleneck as conventional materials discovery processes. (d) Compositional representation enables the use of ML without requiring these expensive structural calculations. (e) Pre-training composition graph encoder encoder with contrastive learning. While the structural graph encoder obtains a deterministic structural representation of crystal, the probabilistic composition encoder learns to represent each composition as a parameterized probabilistic distribution by acquiring mean and diagonal covariance matrices. Both encoders are jointly trained with soft contrastive loss in representation space. (f) The pre-trained composition mean encoder $f_{\mu}^{a}$ can be utilized to predict various properties of materials, while the composition uncertainty encoder $f_{\sigma}^{a}$ can guide scientists in determining which materials to investigate further.
  • Figure 2: Scatter plot between true and predicted $Z\Bar{T}$.
  • Figure 3: High-throughput screening results.
  • Figure 4: Representation learning performance (MAE) comparison trained with different datasets.
  • Figure 5: Representation learning performance (MAE) comparison trained with different relationship.
  • ...and 5 more figures