Covers all aspects of mathematical software development.
We present BioNetFlux, an open-source Python framework for the numerical simulation of coupled systems of partial differential equations (PDEs) on one-dimensional multi-arc networks by the Hybridized Discontinuous Galerkin method. Its design targets biological transport phenomena on graph-like geometries that arise naturally in microfluidic organ-on-chip (OoC) devices, vascular networks, and in-vitro cell-migration assays.
Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.
Model-Based Iterative Reconstruction (MBIR) is important because direct methods, such as Filtered Back-Projection (FBP) can introduce significant noise and artifacts in sparse-angle tomography, especially for time-evolving samples. Although MBIR produces high-quality reconstructions through prior-informed optimization, its computational cost has traditionally limited its broader adoption. In previous work, we addressed this limitation by expressing the Radon transform and its adjoint using non-uniform fast Fourier transforms (NUFFTs), reducing computational complexity relative to conventional projection-based methods. We further accelerated computation by employing a multi-GPU system for parallel processing. In this work, we further accelerate our Fourier-domain framework, by introducing four main strategies: (1) a reformulation of the MBIR forward and adjoint operators that exploits their multi-level Toeplitz structure for efficient Fourier-domain computation; (2) an improved initialization strategy that uses back-projected data filtered with a standard ramp filter as the starting estimate; (3) a hierarchical multi-resolution reconstruction approach that first solves the problem on coarse grids and progressively transitions to finer grids using Lanczos interpolation; and (4) a distributed-memory implementation using MPI that enables near-linear scaling on large high-performance computing (HPC) systems. Together, these innovations significantly reduce iteration counts, improve parallel efficiency, and make high-quality MBIR reconstruction practical for large-scale tomographic imaging. These advances open the door to near-real-time MBIR for applications such as in situ, in operando, and time-evolving experiments.
2603.27613We present a large-scale computational study combining arbitrary-precision arithmetic, sequence acceleration, and the PSLQ integer relation algorithm to discover exact closed-form expressions for fundamental constants arising in asymptotic analysis. We compute the Stokes multipliers C_M of the one-dimensional anharmonic oscillators H = p^2/2 + x^2/2 + g x^{2M} for M = 2, 3, ..., 11, extracting 17-30 significant digits from up to 1200 perturbation coefficients computed at 300-digit working precision. The computational pipeline consists of three stages: (i) Rayleigh-Schrodinger recursion in the harmonic oscillator basis, (ii) Richardson extrapolation of order 40-100 to accelerate convergence of ratio sequences, and (iii) PSLQ searches over bases of Gamma-function values and algebraic numbers. This pipeline discovers three new exact identities: C_3^2 pi^4 = 32, C_5^4 Gamma(1/4)^4 pi^5 = 2^{12} 3^2, and C_7^6 Gamma(1/3)^9 pi^6 = 2^{20} 3^3, in addition to confirming the known C_2^2 pi^3 = 6. Equally significant is a negative result: exhaustive PSLQ searches at 30-digit precision with coefficient bounds up to 2000 find no closed form for C_4, strongly suggesting the x^8 case introduces a genuinely new transcendental number. A number-theoretic pattern emerges: closed-form existence correlates with Euler's totient function phi(M-1)/2, which counts algebraically independent Gamma-function transcendentals at denominator M-1. We formulate conjectures connecting computational constant recognition to classical number theory, and provide all code and data for full reproducibility.
We present a MATLAB-based framework for two- and three-dimensional fast Fourier transforms on multiple GPUs for large-scale numerical simulations using the pseudo-spectral Fourier method. The software implements two complementary multi-GPU strategies that overcome single-GPU memory limitations and accelerate spectral solvers. This approach is motivated by and applied to phase-field crystal (PFC) models, which are governed by tenth-order partial differential equations, require fine spatial resolution, and are typically formulated in periodic domains. Our resulting numerical framework achieves significant speedups, approximately sixfold for standard PFC simulations and up to sixtyfold for multiphysics extensions, compared to a purely CPU-based implementation running on hundreds of cores.
We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.
Differentiable programming has emerged as a structural prerequisite for gradient-based inverse problems and end-to-end hybrid physics--machine learning in computational fluid dynamics. However, existing differentiable CFD platforms are confined to structured Cartesian grids, excluding the geometrically complex domains where body-conforming unstructured discretizations are indispensable. We present DiFVM, the first GPU-accelerated, end-to-end differentiable finite-volume CFD solver operating natively on unstructured polyhedral meshes. The key enabling insight is a structural isomorphism between finite-volume discretization and graph neural network message-passing: by reformulating all FVM operators as static scatter/gather primitives on the mesh connectivity graph, DiFVM transforms irregular unstructured connectivity into a first-class GPU data structure. All operations are implemented in JAX/XLA, providing just-in-time compilation, operator fusion, and automatic differentiation through the complete simulation pipeline. Differentiable Windkessel outlet boundary conditions are provided for cardiovascular applications, and DiFVM accepts standard OpenFOAM case directories without modification for seamless adoption in existing workflows. Forward validation across benchmarks spanning canonical flows to patient-specific hemodynamics demonstrates close agreement with OpenFOAM, and end-to-end differentiability is demonstrated through inference of Windkessel parameters from sparse observations. DiFVM bridges the critical gap between differentiable programming and unstructured-mesh CFD, enabling gradient-based inverse problems and physics-integrated machine learning on complex engineering geometries.
Multiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations. In this study, we achieved benchmark results on x86 and ARM CPU platforms to quantify the accelerations achieved in linear computations and polynomial evaluation by integrating these algorithms.
Scorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.
This monograph presents the design, implementation, and evaluation of Pyroclast, a modular high-performance Python framework for large-scale geodynamic simulations. Pyroclast addresses limitations of legacy geodynamics solvers, often implemented in monolithic Fortran, C++, or C codebases with limited GPU support and extensibility, by combining modern numerical methods, hardware-accelerated execution, and a flexible object-oriented architecture. Designed for distributed and GPU-accelerated environments, Pyroclast provides an accessible and efficient platform for simulating mantle convection and lithospheric deformation using the marker-in-cell method and a matrix-free finite difference discretization. The work focuses on a scalable two-dimensional viscous mechanical solver that forms the computational core for future visco-elasto-plastic models. The solver includes a stress-conservative staggered grid discretization of the incompressible Stokes equations, a matrix-free geometric multigrid solver, Krylov and quasi-Newton methods, and MPI-based domain decomposition for distributed execution. Benchmarks evaluate performance and scalability. Shared-memory tests show strong scaling of the Stokes solver and demonstrate a 5-10x speedup on NVIDIA A100 GPUs compared to a multi-core CPU baseline. Distributed advection benchmarks show near-ideal weak scaling up to 896 CPU cores across seven compute nodes. These results demonstrate that Pyroclast achieves high performance while remaining accessible through a high-level Python interface. The framework also provides a blueprint for modernizing legacy geodynamics codes. Its modular architecture and Python-native implementation lower the barrier to entry while enabling interoperability with modern machine learning libraries, enabling hybrid physics-based and data-driven workflows.
We present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\cite{rader_optimistix_2024} optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.
A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.
We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.
Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.
We present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.
Hybrid finite element methods such as hybridizable discontinuous Galerkin, hybrid high-order and weak Galerkin have emerged as powerful techniques for solving partial differential equations on general polytopal meshes. Despite their diverse mathematical origins, these methods share a common computational structure involving hybrid discrete spaces, local projection operators and static condensation. This work presents a comprehensive framework for implementing such methods within the Gridap finite element library. We introduce new abstractions for polytopal mesh representation using graph-based structures, broken polynomial spaces on arbitrary mesh entities, patch-based local assembly for cell-wise linear systems, high-level local operator construction and automated static condensation. These abstractions enable concise implementations of hybrid methods while maintaining computational efficiency through Julia's just-in-time compilation and Gridap's lazy evaluation strategies. We demonstrate the framework through implementations of several non-conforming polytopal methods for the Poisson problem, linear elasticity, incompressible Stokes flow and optimal control on polytopal meshes.
Hyper-reduction methods have gained increasing attention for their potential to accelerate reduced order models for nonlinear systems, yet their comparative accuracy and computational efficiency are not well understood. Motivated by this gap, we evaluate a range of hyper-reduction techniques for nonlinear finite element models across benchmark problems of varying complexity, assessing the inevitable tradeoff between accuracy and speedup. More specifically, we consider interpolation methods based on the gappy proper orthogonal decomposition as well as the empirical quadrature procedure (EQP), and apply them to the hyper-reduction of problems in nonlinear diffusion, nonlinear elasticity and Lagrangian hydrodynamics. Our numerical results are generated using the open source libROM, Laghos and MFEM numerical libraries. Our findings reveal that the comparative performance between hyper-reduction methods depends on both the problem and the choice of time integration method. The EQP method generally achieves lower relative errors than interpolation methods and is more efficient in terms of quadrature point usage, resulting in a lower wall time for the nonlinear diffusion and elasticity problems. However, its online computational cost is observed to be relatively high for Lagrangian hydrodynamics problems. Conversely, interpolation methods exhibit greater variability, especially with respect to the use of different time integration methods in the Lagrangian hydrodynamics problems. The presented results underscore the need for problem specific method selection to balance accuracy and efficiency, while also offering useful guidance for future comparisons and refinements of hyper-reduction techniques.
Neural networks are increasingly deployed in safety- and mission-critical pipelines, yet many verification and analysis results are produced outside the programming environment that defines and runs the model. This separation creates a semantic gap between the executed network and the analyzed artifact, so guarantees can hinge on implicit conventions such as operator semantics, tensor layouts, preprocessing, and floating-point corner cases. We introduce TorchLean, a framework in the Lean 4 theorem prover that treats learned models as first-class mathematical objects with a single, precise semantics shared by execution and verification. TorchLean unifies (1) a PyTorch-style verified API with eager and compiled modes that lower to a shared op-tagged SSA/DAG computation-graph IR, (2) explicit Float32 semantics via an executable IEEE-754 binary32 kernel and proof-relevant rounding models, and (3) verification via IBP and CROWN/LiRPA-style bound propagation with certificate checking. We validate TorchLean end-to-end on certified robustness, physics-informed residual bounds for PINNs, and Lyapunov-style neural controller verification, alongside mechanized theoretical results including a universal approximation theorem. These results demonstrate a semantics-first infrastructure for fully formal, end-to-end verification of learning-enabled systems.
We present trainsum, a versatile Python package for doing computations with multidimensional quantics tensor trains: https://github.com/fh-igd-iet/trainsum. Using the Array API standard together with opt_einsum, trainsum allows the effortless approximation of tensors or functions by tensor trains independent of their shape or dimensionality. Once approximated, our package can perform normal arithmetic operations with quantics tensor trains, including addition, Einstein summations and element-wise transformations. It can be therefore used for generic computations with applications in simulation, data compression, machine learning and data analysis.
The upcoming IEEE-P3109 standard for low-precision floating-point arithmetic can become the foundation of future machine learning hardware and software. Unlike the fixed types of IEEE-754, P3109 introduces a parametric framework defined by bitwidth, precision, signedness, and domain. This flexibility results in a vast combinatorial space of formats -- some with as little as one bit of precision -- alongside novel features such as stochastic rounding and saturation arithmetic. These deviations create a unique verification gap that this paper intends to address. This paper presents FLoPS, Formalization in Lean of the P3109 Standard, which is a comprehensive formal model of P3109 in Lean. Our work serves as a rigorous, machine-checked specification that facilitates deep analysis of the standard. We demonstrate the model's utility by verifying foundational properties and analyzing key algorithms within the P3109 context. Specifically, we reveal that FastTwoSum exhibits a novel property of computing exact "overflow error" under saturation using any rounding mode, whereas previously established properties of the ExtractScalar algorithm fail for formats with one bit of precision. This work provides a verified foundation for reasoning about P3109 and enables formal verification of future numerical software. Our Lean development is open source and publicly available.