Table of Contents
Fetching ...

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

Alexandre Benoit

TL;DR

This work addresses the high computational cost of SO(3)-equivariant MACE force fields by profiling execution, comparing backends, and evaluating low-precision strategies. It demonstrates that cuEquivariance reduces end-to-end inference latency by about a factor of three, and that casting linear blocks to BF16/FP16 within an FP32 model yields substantial throughput gains while preserving MD observables such as energy and temperature within run-to-run variability. The study also reveals stability trade-offs: FP32_BF16 offers the best accuracy-throughput balance in MD tests, whereas training with half-precision for linear weights can degrade force predictions. Practically, the results advocate defaulting to cuEquivariance with FP32, enabling BF16/FP16 for linear layers (with FP32 accumulations) to maximize throughput, while training remains in FP32; future kernel-level improvements and Ampere/Hopper-era features are expected to unlock additional gains. The work highlights representation-mismatch risks when mixing backends and emphasizes the role of kernel fusion and graph-level optimizations (e.g., FlashTP) in achieving scalable performance gains for equivariant force fields.

Abstract

Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about $3\times$. Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

TL;DR

This work addresses the high computational cost of SO(3)-equivariant MACE force fields by profiling execution, comparing backends, and evaluating low-precision strategies. It demonstrates that cuEquivariance reduces end-to-end inference latency by about a factor of three, and that casting linear blocks to BF16/FP16 within an FP32 model yields substantial throughput gains while preserving MD observables such as energy and temperature within run-to-run variability. The study also reveals stability trade-offs: FP32_BF16 offers the best accuracy-throughput balance in MD tests, whereas training with half-precision for linear weights can degrade force predictions. Practically, the results advocate defaulting to cuEquivariance with FP32, enabling BF16/FP16 for linear layers (with FP32 accumulations) to maximize throughput, while training remains in FP32; future kernel-level improvements and Ampere/Hopper-era features are expected to unlock additional gains. The work highlights representation-mismatch risks when mixing backends and emphasizes the role of kernel fusion and graph-level optimizations (e.g., FlashTP) in achieving scalable performance gains for equivariant force fields.

Abstract

Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about . Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.

Paper Structure

This paper contains 101 sections, 27 equations, 21 figures, 20 tables.

Figures (21)

  • Figure 1: MACE Schematic Batatia2022mace
  • Figure 2: Bit allocations for the floating-point formats used in this work. FP64 offers the highest precision; FP32 is the single-precision baseline. FP16 reduces both exponent and fraction widths; BF16 trades mantissa precision for FP32-like range; TF32 preserves FP32 range with a 10-bit mantissa to enable Tensor-Core acceleration for FP32-coded kernels. The mantissa width governs machine epsilon and hence rounding granularity Rshravan_2025.
  • Figure 3: Structures used in this work.
  • Figure 4: cuEquivariance yields near-constant $\sim$3$\times$ speedup for higher-order blocks. Speedup (e3nn $\rightarrow$ cuEq) vs. batch size for three angular orders. The batch size does not have a high impact. For $\ell\in\{2,3\}$, speedup is $\approx$2.9--3.0$\times$ with weak dependence on batch size; for $\ell=1$ it is $\sim$1.1$\times$. Each point summarizes 100 forward passes under the same setup as Appendix \ref{['appendix:setup']}.
  • Figure 5: Latency distributions: cuEq is faster and less variable. Boxplots of per-step latency over 100 runs. Medians around $\sim$100 ms (cuEq) vs. $\sim$300 ms (e3nn) illustrate the $\approx3\times$ gap and narrower dispersion with cuEq.
  • ...and 16 more figures