Table of Contents
Fetching ...

COMMET: orders-of-magnitude speed-up in finite element method via batch-vectorized neural constitutive updates

Benjamin Alheit, Mathias Peirlinck, Siddhant Kumar

TL;DR

This work targets the computational bottleneck of neural constitutive models in finite element simulations by introducing COMMET, a framework that combines batch-vectorized assembly, compute-graph optimization (CGO) for exact analytical derivatives, and MPI-based parallelism. The proposed approach reorganizes the standard element-wise assembly into batched operations across many quadrature points, enabling SIMD-style acceleration and reduced memory footprints. CGO replaces expensive automatic differentiation with modular, forward-mode derivative calculations, delivering substantial runtime and memory savings. Across material-point tests, FE benchmarks, and a patient-specific heart inflation example, COMMET achieves up to three orders of magnitude speed-ups in constitutive updates and more than two orders in overall simulation time, with strong MPI scaling to thousands of cores. These results establish a practical pathway to deploy high-fidelity NCMs in large-scale computational mechanics and beyond, under an open-source framework that encourages broad adoption and extension.

Abstract

Constitutive evaluations often dominate the computational cost of finite element (FE) simulations whenever material models are complex. Neural constitutive models (NCMs) offer a highly expressive and flexible framework for modeling complex material behavior in solid mechanics. However, their practical adoption in large-scale FE simulations remains limited due to significant computational costs, especially in repeatedly evaluating stress and stiffness. NCMs thus represent an extreme case: their large computational graphs make stress and stiffness evaluations prohibitively expensive, restricting their use to small-scale problems. In this work, we introduce COMMET, an open-source FE framework whose architecture has been redesigned from the ground up to accelerate high-cost constitutive updates. Our framework features a novel assembly algorithm that supports batched and vectorized constitutive evaluations, compute-graph-optimized derivatives that replace automatic differentiation, and distributed-memory parallelism via MPI. These advances dramatically reduce runtime, with speed-ups exceeding three orders of magnitude relative to traditional non-vectorized automatic differentiation-based implementations. While we demonstrate these gains primarily for NCMs, the same principles apply broadly wherever for-loop based assembly or constitutive updates limit performance, establishing a new standard for large-scale, high-fidelity simulations in computational mechanics.

COMMET: orders-of-magnitude speed-up in finite element method via batch-vectorized neural constitutive updates

TL;DR

This work targets the computational bottleneck of neural constitutive models in finite element simulations by introducing COMMET, a framework that combines batch-vectorized assembly, compute-graph optimization (CGO) for exact analytical derivatives, and MPI-based parallelism. The proposed approach reorganizes the standard element-wise assembly into batched operations across many quadrature points, enabling SIMD-style acceleration and reduced memory footprints. CGO replaces expensive automatic differentiation with modular, forward-mode derivative calculations, delivering substantial runtime and memory savings. Across material-point tests, FE benchmarks, and a patient-specific heart inflation example, COMMET achieves up to three orders of magnitude speed-ups in constitutive updates and more than two orders in overall simulation time, with strong MPI scaling to thousands of cores. These results establish a practical pathway to deploy high-fidelity NCMs in large-scale computational mechanics and beyond, under an open-source framework that encourages broad adoption and extension.

Abstract

Constitutive evaluations often dominate the computational cost of finite element (FE) simulations whenever material models are complex. Neural constitutive models (NCMs) offer a highly expressive and flexible framework for modeling complex material behavior in solid mechanics. However, their practical adoption in large-scale FE simulations remains limited due to significant computational costs, especially in repeatedly evaluating stress and stiffness. NCMs thus represent an extreme case: their large computational graphs make stress and stiffness evaluations prohibitively expensive, restricting their use to small-scale problems. In this work, we introduce COMMET, an open-source FE framework whose architecture has been redesigned from the ground up to accelerate high-cost constitutive updates. Our framework features a novel assembly algorithm that supports batched and vectorized constitutive evaluations, compute-graph-optimized derivatives that replace automatic differentiation, and distributed-memory parallelism via MPI. These advances dramatically reduce runtime, with speed-ups exceeding three orders of magnitude relative to traditional non-vectorized automatic differentiation-based implementations. While we demonstrate these gains primarily for NCMs, the same principles apply broadly wherever for-loop based assembly or constitutive updates limit performance, establishing a new standard for large-scale, high-fidelity simulations in computational mechanics.

Paper Structure

This paper contains 28 sections, 43 equations, 15 figures, 1 table, 4 algorithms.

Figures (15)

  • Figure 1: High-level architecture of a neural constitutive model (NCM). The hyperelastic strain energy density formulated as a composition of two functions $\mathcal{K}$ and $\mathcal{N}$. The kinematic layer $\mathcal{K}$ maps the deformation gradient and structural vectors to a set of invariant kinematic scalars, ensuring objectivity and material symmetry. These scalars then serve as input to the inner network $\mathcal{N}$, typically a neural network architecture designed to satisfy convexity conditions required for polyconvexity. The inner network outputs the final strain energy density, which is used to derive the stress and stiffness needed in finite element simulations.
  • Figure 2: Schematic comparison of constitutive update strategies in finite element assembly: (a) the traditional approach whereby the stress and stiffness are calculated for one quadrature point at a time, (b) the globally vectorized approach where the state variables (i.e. deformation gradient and structural vectors in the case of hyperelasticity) for all quadrature points are collected in tables from which associated stress and stiffness tables are calculated in a single vectorized computation, and (c) the batch-vectorized approach where batches of quadrature points are processed at a time.
  • Figure 3: Schematic computer memory hierarchy with decreasing latency and increasing speed from left to right. The hierarchy consists of main memory (consisting of DRAM); CPU cache (consisting of SRAM) which is further divided into L3, L2, and L1 cache; and registers. The L1 cache is further divided into an L1D cache for storing data and an L1I cache for storing instructions, in contrast to the L3 and L2 cache which store both data and instructions. Typically, each CPU has a dedicated L1 and L2 cache while the L3 cache is often shared between multiple CPUs.
  • Figure 4: Schematic comparison of constitutive update procedures in finite element assembly under MPI-based distributed parallelization: (a) traditional, (b) globally vectorized, and (c) batch-vectorized algorithms apply similarly to the single process case shown in Fig. \ref{['fig:batching-arch']}. However, each MPI rank is only responsible for assembly on its associated subdomain of the mesh. Accordingly, for the globally vectorized algorithm (b), the table sizes correspond to the subdomain owned by the rank as opposed to the entire mesh.
  • Figure 5: Effect of batch size on cache efficiency and computational performance of NCMs. Results are shown for various fixed numbers of materials points (indicated in the legend). Metrics include (a) cache misses, (b) wall time, (c) relative cache misses (non-vectorized divided by vectorized), and (d) relative speed-up (non-vectorized wall time divided by vectorized wall time), are reported.
  • ...and 10 more figures