Compositional Representation of Polymorphic Crystalline Materials
Namkyeong Lee, Heewoong Noh, Gyoung S. Na, Jimeng Sun, Tianfan Fu, Marinka Zitnik, Chanyoung Park
TL;DR
PCRL addresses the challenge of learning material representations from composition when crystal polymorphism causes one composition to map to multiple structures. It introduces a probabilistic composition encoder that models each composition as $p(\tilde{\mathbf{z}}^a|\mathbf{X}^a,\mathbf{A}^a) \sim \mathcal{N}(\mathbf{z}_{\mu}^a,\mathbf{z}_{\sigma}^a)$, with $\mathbf{z}_{\mu}^a=f_{\mu}^a(\mathbf{Z}^a)$ and $\mathbf{z}_{\sigma}^a=f_{\sigma}^a(\mathbf{Z}^a)$, and a structural graph encoder $f^b$ to guide alignment via a soft contrastive loss $\mathcal{L}_{\text{con}}$ and KL regularization $\mathcal{L}_{\text{KL}}$, optimized as $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{con}} + \beta \cdot \mathcal{L}_{\text{KL}}$. The framework aligns composition representations with their polymorphic structures through sampling and a soft matching probability, enabling transfer to diverse property prediction tasks without requiring explicit structure. Evaluations on 16 datasets—including DFT-derived and experimental targets across Band Gap, Formation Enthalpy, ESTM properties, and Matbench metrics—show that incorporating structural information and polymorphism-uncertainty yields superior, transferable representations, with uncertainties reflecting polymorphic complexity and guiding material discovery. The work demonstrates that universal compositional representations can be learned by jointly leveraging composition and structure while explicitly modeling uncertainty, offering practical impact for data-scarce real-world material discovery. Code is available at the authors’ GitHub, and experiments indicate PCRL’s robustness and applicability across diverse materials domains.
Abstract
Machine learning (ML) has seen promising developments in materials science, yet its efficacy largely depends on detailed crystal structural data, which are often complex and hard to obtain, limiting their applicability in real-world material synthesis processes. An alternative, using compositional descriptors, offers a simpler approach by indicating the elemental ratios of compounds without detailed structural insights. However, accurately representing materials solely with compositional descriptors presents challenges due to polymorphism, where a single composition can correspond to various structural arrangements, creating ambiguities in its representation. To this end, we introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information. Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation, and our analysis highlights its potential applicability of PCRL in material discovery. The source code for PCRL is available at https://github.com/Namkyeong/PCRL.
