A versatile FEM framework with native GPU scalability via globally-applied AD

Mohit Pundir; Flavio Lorez; David S. Kammer

A versatile FEM framework with native GPU scalability via globally-applied AD

Mohit Pundir, Flavio Lorez, David S. Kammer

TL;DR

This work addresses the long-standing tension between expressive, energy-based finite-element formulations and scalable computation on modern accelerators. It introduces tatva, a monolithic energy-centric FEM framework that differentiates a single global energy functional to obtain residuals and tangents, combining matrix-free Jacobian–vector products with graph-coloring-based sparse differentiation to achieve $\mathcal{O}(N)$ scaling on GPUs. The approach supports a wide range of problems, including multi-point constraints, mixed-dimensional couplings, and data-driven models such as Neural Constitutive Modeling (NCM) and Neural-Operator Element Methods (NOEM), all within a single differentiable pipeline. Performance results show linear scaling up to tens of millions of degrees of freedom with high throughput, significantly outperforming traditional scatter-add assembly in many regimes. The open-source tatva library enables broad adoption for AI-augmented, variational mechanics on contemporary hardware.

Abstract

Energy-based finite-element formulations provide a unified framework for describing complex physical systems in computational mechanics. In these energy-based methods, the governing equations can be obtained directly by considering the derivatives of a single global energy functional. While Automatic Differentiation (AD) can be used to automate the generation of these derivatives, current frameworks face a clear trade-off based primarily on the scale upon which the AD method is applied. Globally applied AD offers high expressivity but cannot currently be scaled to large problems. Locally applied AD scales well through traditional assembly methods, but the variety of physics and couplings that the framework can easily represent is more limited than the global approach. Here, we introduce an energy-centric framework tatva (https://github.com/smec-ethz/tatva) that defines the physics of a problem as a single global functional and applies AD globally to generate residual and tangent operators. By leveraging Jacobian-vector products for matrix-free solvers and coloring-based sparse differentiation for materializing sparse tangent stiffness matrices when needed, our flexible design scales linearly with the problem size on GPUs. We demonstrate that our framework can handle large problems (with millions of degrees of freedom) without memory exhaustion. Additionally, it offers a unified, fully differentiable methodology that can address a wide range of problems, including multi-point constraints, mixed-dimensional coupling, and the incorporation of neural networks, while maintaining high performance and scalability on modern GPU architectures.

A versatile FEM framework with native GPU scalability via globally-applied AD

TL;DR

scaling on GPUs. The approach supports a wide range of problems, including multi-point constraints, mixed-dimensional couplings, and data-driven models such as Neural Constitutive Modeling (NCM) and Neural-Operator Element Methods (NOEM), all within a single differentiable pipeline. Performance results show linear scaling up to tens of millions of degrees of freedom with high throughput, significantly outperforming traditional scatter-add assembly in many regimes. The open-source tatva library enables broad adoption for AI-augmented, variational mechanics on contemporary hardware.

Abstract

Paper Structure (30 sections, 25 equations, 14 figures, 2 algorithms)

This paper contains 30 sections, 25 equations, 14 figures, 2 algorithms.

Introduction
Proposed framework
Matrix-Free Jacobian-Vector Products (JVP)
Sparse differentiation using coloring
Versatility of the proposed framework
Generalization to non-variational systems
Future-proofing with implicit AI coupling
tatva : A monolithic differentiable library
Performance analysis
Scaling and throughput analysis
Comparison with traditional assembly
Examples
Handling elasticity and boundary conditions
Multi-point constraints
Rigid body constraints using condensation
...and 15 more sections

Figures (14)

Figure 1: Schematic of global and local approaches in AD-based finite-element frameworks. The global approach differentiates the total energy to obtain residual and tangent operators, enabling complex problems such as contact between deformable bodies and interface fracture, but with quadratic memory and computational costs due to the construction of a dense tangent stiffness matrix. The local approach applies AD at the element level to obtain local tangent operators, then uses standard scatter-add assembly to form global operators, reducing memory cost but making complex problems harder to model and less expressive.
Figure 2: Schematic of the batched monolithic engine. To mitigate memory bottlenecks on GPUs, the global mesh is partitioned into contiguous element batches. Each batch is processed via SIMD-parallelized kernels that evaluate the local energy density $\psi$ simultaneously at each quadrature point from all elements in that batch and reduce them to a potential energy for that batch. Energy contribution from each batch is further reduced to get the global potential energy $\Psi_h$. The framework then recovers the global residual vector $\boldsymbol{r}$ (via reverse-mode AD) and the stiffness operator $\mathbf{K}$ (via forward-mode AD) directly from this scalar graph, bypassing the legacy element-level matrix assembly entirely.
Figure 3: Schematic view of sparse differentiation via coloring algorithm. (from left-to-right) Foremost, we construct the sparsity pattern of a tangent stiffness matrix using mesh connectivity. The gray boxes represent the non-zero entries, and white represents zero entries. Next, we color the non-zero entries, i.e. the gray boxes, using the distance-2 greedy algorithm. Non-zero entries in each color group are independent of each other and do not share any degrees of freedom. We group the entries color-wise into a compressed form with size $N_\mathrm{dofs} \times c$. Each column represents a basis vector (or composite perturbation vector) with 1 at non-zero entries and 0 in all other rows. Finally, applying a single Jacobian-vector product to each basis vector allows us to construct a compressed Jacobian matrix $\mathbf{J}_\text{comp}$ of size $N_\mathrm{dofs} \times c$ which is then used to reconstruct the sparse tangent stiffness matrix $\mathbf{K}$.
Figure 4: Computational performance and scaling of the energy-centric framework. (left) The log-log shows execution time for a computation with batch size of 50,000 elements. The unit slope for all methods confirms strict $\mathcal{O}(N)$ scaling. (middle) Assembly throughput (Million DoFs/sec) for the same computations, demonstrating optimal GPU utilization across three orders of magnitude. (right) Comparison to a scatter-add assembly approach similar to xue_jax-fem_2023ichihara_naruki-ichiharafeax_2026 for a 2D computation without batching for fair comparison. The throughput for scatter-add assembly is significantly lower, no longer constant, and constantly drops with increasing DoFs, clearly a consequence of $\mathcal{O}(N^2)$ scaling.
Figure 5: Plate with a circular hole of radius $R=0.05$ subjected to horizontal uniaxial tension $t_x=0.01$. (left) Distribution of the radial stress component $\sigma_{rr}$ in the half-domain model with symmetry boundary conditions imposed on the left edge. Dashed lines indicate radial paths corresponding to $\theta=0^\circ$ (loading direction) and $\theta=45^\circ$. (middle/right) Radial stress components $\sigma_{rr}$, $\sigma_{\theta\theta}$, and $\sigma_{r\theta}$ along $\theta=0^\circ$, and $\theta=45^\circ$, respectively, plotted as functions of the radial coordinate $r$. The solid lines represent the results obtained using the framework presented here; these results are compared with Kirsch’s analytical solution for an infinite plate with a circular hole (dashed line).
...and 9 more figures

A versatile FEM framework with native GPU scalability via globally-applied AD

TL;DR

Abstract

A versatile FEM framework with native GPU scalability via globally-applied AD

Authors

TL;DR

Abstract

Table of Contents

Figures (14)