Table of Contents
Fetching ...

Optimal Invariant Bases for Atomistic Machine Learning

Alice E. A. Allen, Emily Shinkle, Roxana Bujack, Nicholas Lubbers

TL;DR

The paper tackles the problem of designing complete yet compact invariant descriptors for atomistic machine learning. It develops two complementary approaches: (1) pruning the ACE descriptor set via Langbein-style Jacobian analysis to yield a functionally independent subset, and (2) constructing a flexible, Cartesian-invariant basis that enables neurons to recognize up to 5-body patterns within HIP-HOP-NN. The resulting ACE reductions show improved force accuracy across materials, while HIP-HOP-NN achieves state-of-the-art or near-state-of-the-art performance on methane, QM7, QM9, and COMP6 benchmarks with favorable computational efficiency. The work provides a practical, broadly applicable framework for minimizing descriptor cost while maximizing expressivity in atomistic representations, with potential extensions to higher-order invariants and covariant networks.

Abstract

The representation of atomic configurations for machine learning models has led to the development of numerous descriptors, often to describe the local environment of atoms. However, many of these representations are incomplete and/or functionally dependent. Incomplete descriptor sets are unable to represent all meaningful changes in the atomic environment. Complete constructions of atomic environment descriptors, on the other hand, often suffer from a high degree of functional dependence, where some descriptors can be written as functions of the others. These redundant descriptors do not provide additional power to discriminate between different atomic environments and increase the computational burden. By employing techniques from the pattern recognition literature to existing atomistic representations, we remove descriptors that are functions of other descriptors to produce the smallest possible set that satisfies completeness. We apply this in two ways: first we refine an existing description, the Atomistic Cluster Expansion. We show that this yields a more efficient subset of descriptors. Second, we augment an incomplete construction based on a scalar neural network, yielding a new message-passing network architecture that can recognize up to 5-body patterns in each neuron by taking advantage of an optimal set of Cartesian tensor invariants. This architecture shows strong accuracy on state-of-the-art benchmarks while retaining low computational cost. Our results not only yield improved models, but point the way to classes of invariant bases that minimize cost while maximizing expressivity for a host of applications.

Optimal Invariant Bases for Atomistic Machine Learning

TL;DR

The paper tackles the problem of designing complete yet compact invariant descriptors for atomistic machine learning. It develops two complementary approaches: (1) pruning the ACE descriptor set via Langbein-style Jacobian analysis to yield a functionally independent subset, and (2) constructing a flexible, Cartesian-invariant basis that enables neurons to recognize up to 5-body patterns within HIP-HOP-NN. The resulting ACE reductions show improved force accuracy across materials, while HIP-HOP-NN achieves state-of-the-art or near-state-of-the-art performance on methane, QM7, QM9, and COMP6 benchmarks with favorable computational efficiency. The work provides a practical, broadly applicable framework for minimizing descriptor cost while maximizing expressivity in atomistic representations, with potential extensions to higher-order invariants and covariant networks.

Abstract

The representation of atomic configurations for machine learning models has led to the development of numerous descriptors, often to describe the local environment of atoms. However, many of these representations are incomplete and/or functionally dependent. Incomplete descriptor sets are unable to represent all meaningful changes in the atomic environment. Complete constructions of atomic environment descriptors, on the other hand, often suffer from a high degree of functional dependence, where some descriptors can be written as functions of the others. These redundant descriptors do not provide additional power to discriminate between different atomic environments and increase the computational burden. By employing techniques from the pattern recognition literature to existing atomistic representations, we remove descriptors that are functions of other descriptors to produce the smallest possible set that satisfies completeness. We apply this in two ways: first we refine an existing description, the Atomistic Cluster Expansion. We show that this yields a more efficient subset of descriptors. Second, we augment an incomplete construction based on a scalar neural network, yielding a new message-passing network architecture that can recognize up to 5-body patterns in each neuron by taking advantage of an optimal set of Cartesian tensor invariants. This architecture shows strong accuracy on state-of-the-art benchmarks while retaining low computational cost. Our results not only yield improved models, but point the way to classes of invariant bases that minimize cost while maximizing expressivity for a host of applications.

Paper Structure

This paper contains 28 sections, 33 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Abstract demonstration of the need for flexible basis sets. Left: a set of two descriptors visualized as iso-contours over the 2-D plane. Right: the gradients of the descriptors, as well as the condition number of the Jacobian. Because of the degeneracy of the Jacobian on the 1D manifold, each combination of descriptor values appears twice, and thus the corresponding functions cannot be differentiated by these two descriptors.
  • Figure 2: The change in test set force error with the size of the basis set for the six materials present in Ref. Zuo2020. A comparison of the full basis set and the basis set reduced with the Langbein (LB) algorithm is shown. The dashed lines show the lowest error achievable with each basis. For a given number of basis elements, the LB basis set provides consistently better accuracy.
  • Figure 3: The reduction in the basis-set size with the Langbein algorithm for varying polynomial degrees and $w_L$ values. The degree of basis-set reduction is consistent and approximately follows a power law a with scaling exponent of 0.75.
  • Figure 4: Model error versus training set size for different models trained on single-molecule methane configurations. For small dataset sizes, different HIP-NN architecture variants produce similar performance. As more and more data becomes available, HIP-HOP-NN is able to learn far more detail about geometries in the environment, significantly surpassing HIP-NN-TS and HIP-NN.
  • Figure 5: The change in test set energy error with the size of the basis set for the six materials present in Ref. Zuo2020. A comparison of the full basis set and the basis set reduced with the Langbein algorithm is shown. The dashed lines show the lowest error achievable for the full basis set and Langbein subset.
  • ...and 1 more figures