Table of Contents
Fetching ...

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

Ahmed H. Mahmoud, Rahul Goel, Jonathan Ragan-Kelley, Justin Solomon

TL;DR

This paper presents a GPU-native automatic differentiation system tailored for mesh-based computations, exploiting locality and sparsity with per-element forward-mode AD executed entirely inside registers and shared memory. By preallocating sparse Hessian and Jacobian structures from mesh topology and assembling them on the GPU using patch-based RXMesh, the approach avoids global computation graphs and CPU–GPU transfers, supporting dynamic sparsity via GPU-only updates and atomic assembly. The method supports both scalar and vector-valued objectives, dynamic interaction terms, and matrix-free Hessian-vector products, enabling efficient Newton-, Gauss-Newton-, and L-BFGS-style optimization across large-scale meshes. Across seven applications—ranging from cloth and ARAP deformation to integrable polyvector fields and large-scale contact simulations—the system achieves significant speedups over PyTorch, JAX, Warp, Dr.JIT, and Thallo, demonstrating that second-order mesh optimization can be practical at scale with locality-aware differentiation.

Abstract

We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar and vector-valued objectives, dynamic interaction-driven sparsity updates, and seamless integration with external GPU sparse linear solvers. We evaluate our system on applications including elastic and cloth simulation, surface parameterization, mesh smoothing, frame field design, ARAP deformation, and spherical manifold optimization. Across these tasks, our system consistently outperforms state-of-the-art differentiation frameworks, including PyTorch, JAX, Warp, Dr.JIT, and Thallo. We demonstrate speedups across a range of solver types, from Newton and Gauss-Newton for nonlinear least squares to L-BFGS and gradient descent, and across different derivative usage modes, including Hessian-vector products as well as full sparse Hessian and Jacobian construction.

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

TL;DR

This paper presents a GPU-native automatic differentiation system tailored for mesh-based computations, exploiting locality and sparsity with per-element forward-mode AD executed entirely inside registers and shared memory. By preallocating sparse Hessian and Jacobian structures from mesh topology and assembling them on the GPU using patch-based RXMesh, the approach avoids global computation graphs and CPU–GPU transfers, supporting dynamic sparsity via GPU-only updates and atomic assembly. The method supports both scalar and vector-valued objectives, dynamic interaction terms, and matrix-free Hessian-vector products, enabling efficient Newton-, Gauss-Newton-, and L-BFGS-style optimization across large-scale meshes. Across seven applications—ranging from cloth and ARAP deformation to integrable polyvector fields and large-scale contact simulations—the system achieves significant speedups over PyTorch, JAX, Warp, Dr.JIT, and Thallo, demonstrating that second-order mesh optimization can be practical at scale with locality-aware differentiation.

Abstract

We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar and vector-valued objectives, dynamic interaction-driven sparsity updates, and seamless integration with external GPU sparse linear solvers. We evaluate our system on applications including elastic and cloth simulation, surface parameterization, mesh smoothing, frame field design, ARAP deformation, and spherical manifold optimization. Across these tasks, our system consistently outperforms state-of-the-art differentiation frameworks, including PyTorch, JAX, Warp, Dr.JIT, and Thallo. We demonstrate speedups across a range of solver types, from Newton and Gauss-Newton for nonlinear least squares to L-BFGS and gradient descent, and across different derivative usage modes, including Hessian-vector products as well as full sparse Hessian and Jacobian construction.

Paper Structure

This paper contains 39 sections, 11 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Prior to our system, nonlinear optimization using Newton's method (shown here as the time for a single Newton iteration of a mass--spring cloth simulation on a $100^2$-vertex mesh) was bottlenecked by differentiation. Our system significantly accelerates derivative computation, shifting the bottleneck to the linear solver.
  • Figure 2: Per-element derivative computation and assembly. For a local energy term $f_i$ over triangle $ABC$, we compute derivatives with respect to local variables ($\in \mathbb{R}^3$) using compact indexing. We evaluate gradients and Hessians in registers/shared memory as small dense vectors/matrices via forward-mode AD, then we map them to global indices and accumulate them into the global pre-allocated gradient, sparse Hessian, or sparse Jacobian.
  • Figure 3: A typical network in machine learning workloads exhibits dense connectivity across many layers resulting in high arithmetic intensity and making reverse-mode AD a suitable choice (left). In contrast, mesh-based problems give rise to shallow, sparsely connected computation graphs (right) where each local function depends only on a small subset of inputs. This structure makes forward-mode AD more efficient for the GPU execution.
  • Figure 4: Cloth simulation driven by a mass--spring system with inertial energy, elastic spring potentials, and gravity. The simulation relies on sparse Hessian computation of these energies at each time step.
  • Figure 5: Impact of patch size on differentiation time
  • ...and 8 more figures