Table of Contents
Fetching ...

Establishing Deep InfoMax as an effective self-supervised learning methodology in materials informatics

Michael Moran, Vladimir V. Gusev, Michael W. Gaultois, Dmytro Antypov, Matthew J. Rosseinsky

TL;DR

This work addresses the bottleneck of scarce property labels in materials informatics by applying Deep InfoMax as a self-supervised pretraining framework to crystal transformers (Site-Net). By maximising mutual information between local and global crystal representations and their constituents, the authors enable large-scale pretraining on unlabeled CIF-derived data, improving downstream property prediction on small datasets for formation energy and band gap. The study demonstrates that representation learning and transfer learning with DIM yield robust gains in low-data regimes, with domain-aware false sampling further enhancing performance for formation energy. The findings suggest DIM can serve as a foundation-model precursor for materials informatics, compatible with multiple crystal architectures and loss-function integrations, and highlight the importance of baselines, false-sampling strategies, and in-distribution pretraining to isolate self-supervised benefits.

Abstract

The scarcity of property labels remains a key challenge in materials informatics, whereas materials data without property labels are abundant in comparison. By pretraining supervised property prediction models on self-supervised tasks that depend only on the "intrinsic information" available in any Crystallographic Information File (CIF), there is potential to leverage the large amount of crystal data without property labels to improve property prediction results on small datasets. We apply Deep InfoMax as a self-supervised machine learning framework for materials informatics that explicitly maximises the mutual information between a point set (or graph) representation of a crystal and a vector representation suitable for downstream learning. This allows the pretraining of supervised models on large materials datasets without the need for property labels and without requiring the model to reconstruct the crystal from a representation vector. We investigate the benefits of Deep InfoMax pretraining implemented on the Site-Net architecture to improve the performance of downstream property prediction models with small amounts (<10^3) of data, a situation relevant to experimentally measured materials property databases. Using a property label masking methodology, where we perform self-supervised learning on larger supervised datasets and then train supervised models on a small subset of the labels, we isolate Deep InfoMax pretraining from the effects of distributional shift. We demonstrate performance improvements in the contexts of representation learning and transfer learning on the tasks of band gap and formation energy prediction. Having established the effectiveness of Deep InfoMax pretraining in a controlled environment, our findings provide a foundation for extending the approach to address practical challenges in materials informatics.

Establishing Deep InfoMax as an effective self-supervised learning methodology in materials informatics

TL;DR

This work addresses the bottleneck of scarce property labels in materials informatics by applying Deep InfoMax as a self-supervised pretraining framework to crystal transformers (Site-Net). By maximising mutual information between local and global crystal representations and their constituents, the authors enable large-scale pretraining on unlabeled CIF-derived data, improving downstream property prediction on small datasets for formation energy and band gap. The study demonstrates that representation learning and transfer learning with DIM yield robust gains in low-data regimes, with domain-aware false sampling further enhancing performance for formation energy. The findings suggest DIM can serve as a foundation-model precursor for materials informatics, compatible with multiple crystal architectures and loss-function integrations, and highlight the importance of baselines, false-sampling strategies, and in-distribution pretraining to isolate self-supervised benefits.

Abstract

The scarcity of property labels remains a key challenge in materials informatics, whereas materials data without property labels are abundant in comparison. By pretraining supervised property prediction models on self-supervised tasks that depend only on the "intrinsic information" available in any Crystallographic Information File (CIF), there is potential to leverage the large amount of crystal data without property labels to improve property prediction results on small datasets. We apply Deep InfoMax as a self-supervised machine learning framework for materials informatics that explicitly maximises the mutual information between a point set (or graph) representation of a crystal and a vector representation suitable for downstream learning. This allows the pretraining of supervised models on large materials datasets without the need for property labels and without requiring the model to reconstruct the crystal from a representation vector. We investigate the benefits of Deep InfoMax pretraining implemented on the Site-Net architecture to improve the performance of downstream property prediction models with small amounts (<10^3) of data, a situation relevant to experimentally measured materials property databases. Using a property label masking methodology, where we perform self-supervised learning on larger supervised datasets and then train supervised models on a small subset of the labels, we isolate Deep InfoMax pretraining from the effects of distributional shift. We demonstrate performance improvements in the contexts of representation learning and transfer learning on the tasks of band gap and formation energy prediction. Having established the effectiveness of Deep InfoMax pretraining in a controlled environment, our findings provide a foundation for extending the approach to address practical challenges in materials informatics.
Paper Structure (24 sections, 6 equations, 12 figures, 1 table)

This paper contains 24 sections, 6 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The intuition for how a Deep InfoMax model and an autoencoder evaluate the quality of a learned vector representation ($z$) is compared and contrasted. In each case, the considered representation is that of an oxygen local environment in the conventional unit cell of lithium oxide, the oxygen atom has been centred in the unit cell and highlighted. (a) In an autoencoder, the model attempts to reconstruct the local environment from the latent representation; the model is then evaluated on the basis of the accuracy of the reconstruction. (b) Deep InfoMax avoids the requirement to reconstruct the local environment by answering an array of questions. The array of questions is generated by pairing the representation with both local samples used to create it and local samples from a different crystal; the questions are on which pairs belong to each other and which do not. In this case the samples are the individual neighbours within the local environment and their distance from the central atom, which are the constituents of the local environment set prior to pooling into a single vector. The questions are answered using a binary classifier is evaluated for its ability to answer the question of which neighbours belong to a local environment and which do not. The ability for the model to correctly identify all so called "true samples" and "false samples" for a representation verifies the information content of the representation in the same way that a successful reconstruction does in (a). More formally, the ability for the model to identify all constituents of the set given the representation acts as a lower bound on mutual information between the representation and the original input.
  • Figure 2: The Site-Net architecture represents the crystal as a (a, $B_{i,j,f}$) matrix of every bond in a large cubic supercell consisting of the elemental features of the sites, their neighbours, and the chemical interaction between them. For demonstration purposes the features are shown for a primitive unit cell of Li$_2$O as per (\ref{['fig:Intuition_crystal']}). The colours in the bond features represent the information from the lithium features, the oxygen features, and the interaction between them which become increasingly mixed as the representation is reduced to a single vector. $i$ and $j$ represent which atomic site is being considered and are of length equal to the number of sites in the supercell. $f$ is the featurisation axis and varies in length throughout the model according to the size of the neural network layers and the starting feature vector length. Each row ($i$) and column ($j$) in $B_{i,j,f}$ can be understood to be a local environment centred on site $i$. Site-Net uses self-attention to pool pairwise interactions centred on a particular site into (b, $S^\prime_{i,f}$) local environment features, and the resulting set of local environment features are then pooled into a (c, $G_f$) single global feature vector that describes the entire crystal. The colours demonstrate the increasing levels of information mixing from raw pairs of elemental features with a distance, to local environments, to a single feature vector summarising the crystal. The aggregation of pairwise interactions into local environment descriptors and the aggregation of local environment descriptors into a global feature vector represent distinct steps of aggregation where information loss is possible. Therefore, Deep InfoMax is applied independently to both parts of the process. Mutual information is maximised between the vector representing the local environment and its constituent pairwise interactions, and mutual information is maximised between the global crystal representation and its constituent local environments.
  • Figure 3: The computation of the Jensen-Shannon entropy loss function is visualised. The representation ($z$), true sample ($c$) and false sample ($c^\prime$) are first upscaled to a shared higher dimensional space. Once in this higher dimensional space, the dot product is taken between the learned representation and the two samples. The Jensen-Shannon entropy loss function is then used to maximise the separation between the two dot products. The greater this separation across the dataset, the greater the lower bound on mutual information between the representation and the constituents from which it was created.
  • Figure 4: The learning of a global representation of the crystal takes place in two distinct steps. The first step is to generate summaries of the local environments of the sites in the crystal using a Site-Net transformer block and to maximise the mutual information between local environment features and the pairwise interactions they are constructed from. The second step is to process the local environment features through some shared neural network layers before taking the mean to construct a global feature vector. Mutual information is maximised between the learned local environment features and the global feature vector. These are effectively two separate models, and the gradients are isolated from each other. The global Deep InfoMax objective cannot adjust the features of the local environments. The backpropagation is shown explicitly using coloured arrows. The gradient flow for the local environment feature learning is shown in red, and the gradient flow for the global feature learning is shown in blue. Dashed lines represent the forward propagation of features without corresponding back propagation. The two essentially independent sub models are trained in parallel.
  • Figure 5: The false sampling strategy is shown in context of (a) Li$_2$O. The false samples consist of (b) an unrelated real crystal, (c) an artificial crystal with the same stoichiometry as Li$_2$O with geometry donated from an unrelated real crystal, (d) an artificial crystal with the same structure as Li$_2$O but with stoichiometry donated from an unrelated real crystal, and (e) the same stoichiometry and structure as Li$_2$O but the positions of each atomic species are randomised. The final false sample with shared structure and stoichiometry is only deployed when learning global representations from the local environment representations, as the risk of "sample collision", that is, false samples being the same as true samples by chance, is too high when considering only two elements and a distance.
  • ...and 7 more figures