Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations
Ahmed H. Mahmoud, Rahul Goel, Jonathan Ragan-Kelley, Justin Solomon
TL;DR
This paper presents a GPU-native automatic differentiation system tailored for mesh-based computations, exploiting locality and sparsity with per-element forward-mode AD executed entirely inside registers and shared memory. By preallocating sparse Hessian and Jacobian structures from mesh topology and assembling them on the GPU using patch-based RXMesh, the approach avoids global computation graphs and CPU–GPU transfers, supporting dynamic sparsity via GPU-only updates and atomic assembly. The method supports both scalar and vector-valued objectives, dynamic interaction terms, and matrix-free Hessian-vector products, enabling efficient Newton-, Gauss-Newton-, and L-BFGS-style optimization across large-scale meshes. Across seven applications—ranging from cloth and ARAP deformation to integrable polyvector fields and large-scale contact simulations—the system achieves significant speedups over PyTorch, JAX, Warp, Dr.JIT, and Thallo, demonstrating that second-order mesh optimization can be practical at scale with locality-aware differentiation.
Abstract
We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar and vector-valued objectives, dynamic interaction-driven sparsity updates, and seamless integration with external GPU sparse linear solvers. We evaluate our system on applications including elastic and cloth simulation, surface parameterization, mesh smoothing, frame field design, ARAP deformation, and spherical manifold optimization. Across these tasks, our system consistently outperforms state-of-the-art differentiation frameworks, including PyTorch, JAX, Warp, Dr.JIT, and Thallo. We demonstrate speedups across a range of solver types, from Newton and Gauss-Newton for nonlinear least squares to L-BFGS and gradient descent, and across different derivative usage modes, including Hessian-vector products as well as full sparse Hessian and Jacobian construction.
