Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

Alexandre Benoit

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

Alexandre Benoit

TL;DR

This work addresses the high computational cost of SO(3)-equivariant MACE force fields by profiling execution, comparing backends, and evaluating low-precision strategies. It demonstrates that cuEquivariance reduces end-to-end inference latency by about a factor of three, and that casting linear blocks to BF16/FP16 within an FP32 model yields substantial throughput gains while preserving MD observables such as energy and temperature within run-to-run variability. The study also reveals stability trade-offs: FP32_BF16 offers the best accuracy-throughput balance in MD tests, whereas training with half-precision for linear weights can degrade force predictions. Practically, the results advocate defaulting to cuEquivariance with FP32, enabling BF16/FP16 for linear layers (with FP32 accumulations) to maximize throughput, while training remains in FP32; future kernel-level improvements and Ampere/Hopper-era features are expected to unlock additional gains. The work highlights representation-mismatch risks when mixing backends and emphasizes the role of kernel fusion and graph-level optimizations (e.g., FlashTP) in achieving scalable performance gains for equivariant force fields.

Abstract

Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about $3\times$. Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

TL;DR

Abstract

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)