Table of Contents
Fetching ...

Atom-Density Representations for Machine Learning

Michael J. Willatt, Felix Musil, Michele Ceriotti

TL;DR

The paper tackles the challenge of designing complete yet compact atomistic representations for machine-learning predictions by formulating atomic configurations as density-based kets within a basis-independent Dirac framework. It shows that popular descriptors like SOAP emerge as specific realizations of invariant kets and introduces a unified view where symmetry averaging and tensor products encode multi-body correlations; the framework also provides operators (Ũ) to couple structure and composition, apply radial scaling, and enable alchemical cross-element features. Key contributions include establishing the connection between Behler-Parrinello, SOAP, and tensorialSOAP within a single formalism, and proposing low-rank and non-factorizable operator schemes to tailor representations for efficiency and accuracy. The approach paves the way for flexible, scalable, and chemically informed density-based ML representations for molecules and materials, with practical guidance for tuning feature dimensionality and prioritizing informative correlations.

Abstract

The applications of machine learning techniques to chemistry and materials science become more numerous by the day. The main challenge is to devise representations of atomic systems that are at the same time complete and concise, so as to reduce the number of reference calculations that are needed to predict the properties of different types of materials reliably. This has led to a proliferation of alternative ways to convert an atomic structure into an input for a machine-learning model. We introduce an abstract definition of chemical environments that is based on a smoothed atomic density, using a bra-ket notation to emphasize basis set independence and to highlight the connections with some popular choices of representations for describing atomic systems. The correlations between the spatial distribution of atoms and their chemical identities are computed as inner products between these feature kets, which can be given an explicit representation in terms of the expansion of the atom density on orthogonal basis functions, that is equivalent to the smooth overlap of atomic positions (SOAP) power spectrum, but also in real space, corresponding to $n$-body correlations of the atom density. This formalism lays the foundations for a more systematic tuning of the behavior of the representations, by introducing operators that represent the correlations between structure, composition, and the target properties. It provides a unifying picture of recent developments in the field and indicates a way forward towards more effective and computationally affordable machine-learning schemes for molecules and materials.

Atom-Density Representations for Machine Learning

TL;DR

The paper tackles the challenge of designing complete yet compact atomistic representations for machine-learning predictions by formulating atomic configurations as density-based kets within a basis-independent Dirac framework. It shows that popular descriptors like SOAP emerge as specific realizations of invariant kets and introduces a unified view where symmetry averaging and tensor products encode multi-body correlations; the framework also provides operators (Ũ) to couple structure and composition, apply radial scaling, and enable alchemical cross-element features. Key contributions include establishing the connection between Behler-Parrinello, SOAP, and tensorialSOAP within a single formalism, and proposing low-rank and non-factorizable operator schemes to tailor representations for efficiency and accuracy. The approach paves the way for flexible, scalable, and chemically informed density-based ML representations for molecules and materials, with practical guidance for tuning feature dimensionality and prioritizing informative correlations.

Abstract

The applications of machine learning techniques to chemistry and materials science become more numerous by the day. The main challenge is to devise representations of atomic systems that are at the same time complete and concise, so as to reduce the number of reference calculations that are needed to predict the properties of different types of materials reliably. This has led to a proliferation of alternative ways to convert an atomic structure into an input for a machine-learning model. We introduce an abstract definition of chemical environments that is based on a smoothed atomic density, using a bra-ket notation to emphasize basis set independence and to highlight the connections with some popular choices of representations for describing atomic systems. The correlations between the spatial distribution of atoms and their chemical identities are computed as inner products between these feature kets, which can be given an explicit representation in terms of the expansion of the atom density on orthogonal basis functions, that is equivalent to the smooth overlap of atomic positions (SOAP) power spectrum, but also in real space, corresponding to -body correlations of the atom density. This formalism lays the foundations for a more systematic tuning of the behavior of the representations, by introducing operators that represent the correlations between structure, composition, and the target properties. It provides a unifying picture of recent developments in the field and indicates a way forward towards more effective and computationally affordable machine-learning schemes for molecules and materials.

Paper Structure

This paper contains 20 sections, 61 equations, 4 figures.

Figures (4)

  • Figure 1: Atom-density-based structural representations, expressed in the real-space $\bra{\mathbf{r}}$ basis. (a) A structure can be mapped onto a smooth atom density built as a superposition of smooth atom-centered functions. The overall density can be decomposed in atom-centered environments, and information on chemical compositions can be stored by decorating the functions with elemental kets. (b) The $\nu=1$ invariant ket corresponds to spherical averaging of the environmental atom density. (c) The $\nu=2$ invariant ket corresponds to three-body correlations, which are obtained by integrating over all rotations a stencil corresponding to two distances along two directions with a fixed angle $\arccos \omega$ between them.
  • Figure 2: Isocontours of the 3-body correlation functions associated with the environment centered on the tagged carbon atom of an ethanol molecule. From left to right, the figures correspond to $\bra{\text{C}r\text{H}r'\omega}\ket{\mathcal{X}^{(2)}_j}_{\hat{R}}/rr'$, $\bra{\text{O}r\text{H}r'\omega}\ket{\mathcal{X}^{(2)}_j}_{\hat{R}}/rr'$, $\bra{\text{O}r\text{H}r'\omega}\ket{\mathcal{X}^{(2)}_j}_{\hat{R}}/rr'$.
  • Figure 3: Schematic representation of the construction of a real-space representation of a tensorial ket associated with a $\lambda$-SOAP kernel. The (smooth) atom density is evaluated at two points corresponding to a stencil $(r,r',\omega)$, and the spherical harmonic $Y^\lambda_\mu$ is evaluated at the angles $(\theta,\phi)$, relative to the reference frame that is used to describe the stencil.
  • Figure 4: (a) Permutation-variant structural descriptors can be stored in a vector to be used as an atomic-scale representation. (b) Sorting this vector makes it permutationally invariant. (c) It is easy to see how the sorted vector relates to the cumulative distribution function associated with the histogram of the values of the structural features.