Table of Contents
Fetching ...

Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians

Ishan Amin, Sanjeev Raja, Aditi Krishnapriyan

TL;DR

This work introduces Hessian distillation, a knowledge-distillation framework that transfers the rich, general-purpose representations of MLFF foundation models to fast, specialized student models by matching energy Hessians. By precomputing teacher Hessians and subsampling Hessian rows during training, the approach delivers up to 20× faster inference while preserving or improving energy and force accuracy, and ensuring energy conservation in MD simulations. The method is demonstrated across three FM-to-student pipelines (MACE-OFF on SPICE, MACE-MP-0 on Materials Project, and JMP on MD22), yielding faster, more stable simulations and improved geometry optimization on diverse chemical spaces. The authors also provide Ablations and discuss the practical trade-offs, limitations, and a future vision where foundation models serve as reservoirs for specialized, efficient simulation engines tailored to downstream tasks.

Abstract

The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller "student" MLFF is trained to match the Hessians of the energy predictions of the "teacher" foundation model. Our specialized MLFFs can be up to 20 $\times$ faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation "engines" for common chemical subsets.

Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians

TL;DR

This work introduces Hessian distillation, a knowledge-distillation framework that transfers the rich, general-purpose representations of MLFF foundation models to fast, specialized student models by matching energy Hessians. By precomputing teacher Hessians and subsampling Hessian rows during training, the approach delivers up to 20× faster inference while preserving or improving energy and force accuracy, and ensuring energy conservation in MD simulations. The method is demonstrated across three FM-to-student pipelines (MACE-OFF on SPICE, MACE-MP-0 on Materials Project, and JMP on MD22), yielding faster, more stable simulations and improved geometry optimization on diverse chemical spaces. The authors also provide Ablations and discuss the practical trade-offs, limitations, and a future vision where foundation models serve as reservoirs for specialized, efficient simulation engines tailored to downstream tasks.

Abstract

The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller "student" MLFF is trained to match the Hessians of the energy predictions of the "teacher" foundation model. Our specialized MLFFs can be up to 20 faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation "engines" for common chemical subsets.
Paper Structure (44 sections, 6 equations, 5 figures, 19 tables)

This paper contains 44 sections, 6 equations, 5 figures, 19 tables.

Figures (5)

  • Figure 1: Proposed Hessian distillation schematic. In our proposed distillation approach, we start with a machine learning force field (MLFF) foundation model (FM) that has been trained on a large quantity of diverse data. We precompute energy Hessians of the FM over a specialized data subset. We then train a series of smaller MLFFs on these subsets via our knowledge distillation loss ($\mathcal{L}_{KD}$), which aligns selected rows of the energy Hessian of the smaller (student) models with those of the FM (teacher). We also keep the conventional procedure of training on the ground truth energies and forces ($\mathcal{L}_{EF}$) from the specialized subset. The resulting MLFFs are considerably faster than the FM and can be efficiently used in downstream applications such as MD simulation, geometry optimization, and free energy calculations.
  • Figure 2: Energy Conservation in NVE MD Simulations of Buckyball Catcher. We plot the change in the model predicted energy over the trajectory for 5 independent initial conditions. Some simulations become unstable before 100 ps (denoted by $\times$). (a) Hessian distillation improves the energy conservation of GemNet-dT models, which outperforms that of JMP-L. (b) Our student GemNet-T models conserve energy due to using conservative forces, while the JMP-L FM energy steadily drifts, broadly suggesting that large-scale models with few built-in constraints can be effectively distilled into smaller, constrained models. (c) Change in energy plotted against test force MAE. Distillation into a GemNet-T student combines the general-purpose representations and accuracy of JMP-L with the physical inductive biases of conservative forces.
  • Figure 3: Parameter count and Hessian subsampling ablations. (a) Force MAE on the Monomers split of SPICE as a function of the GemNet-dT student MLFF simulation speed. The size of the dots indicates the relative number of trainable parameters in the each model. Compared to the undistilled model, Hessian distillation improves the speed-accuracy tradeoff. (b) Force MAE on the Solvated Amino Acid split of SPICE as a function of the number of rows of the energy Hessian subsampled at each training iteration. The size of the dots and text indicates the time required per step of training, relative to training without distillation. Reducing down to $s=1$ does not have a detrimental effect on model accuracy, and results in more efficient training.
  • Figure 4: Stability of Constant Temperature MD Simulations. Results of constant temperature (NVT) MD simulations using the distilled GemNet-dT and PaiNN student MLFFs. We plot the maximum bond length deviation during NVT simulations of 5 selected systems from the SPICE Solvated Amino Acid split. $\times$ denotes the point at which the simulation becomes unstable. Our distilled models are considerably more stable than their undistilled counterparts, both for (a) GemNet-dT and (b) PaiNN.
  • Figure 5: Geometry optimization with GemNet-dT student MLFFs. (a) Difference in energy of the final, relaxed structure obtained via the distilled and undistiled models. On average, the distilled model converges to lower energy structures. (b) Mean per-atom force norm of the final, relaxed structure obtained via the distilled and undistiled models. On average, the distilled model converges to lower structures with lower force norms.