Table of Contents
Fetching ...

A Survey of Neural Network Variational Monte Carlo from a Computing Workload Characterization Perspective

Zhengze Xiao, Xuanzhe Ding, Yuyang Lou, Lixue Cheng, Chaojian Li

Abstract

Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm--hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.

A Survey of Neural Network Variational Monte Carlo from a Computing Workload Characterization Perspective

Abstract

Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm--hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.
Paper Structure (23 sections, 10 equations, 6 figures, 1 table)

This paper contains 23 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the end-to-end NNVMC workflow (top) and representative application domains (bottom). The workflow covers ansätz construction, Markov chain Monte Carlo (MCMC) sampling of electron configurations, wavefunction and local-energy evaluation, and iterative variational optimization.
  • Figure 2: Overview of the execution pipeline of PauliNet and FermiNet in the deepqmc codebase. Stages A--E represent feature construction and embedding, electron-correlation updates through neural blocks, readout projection, Slater-determinant wavefunction assembly, and derivative/Laplacian evaluation for local-energy computation. The bracket below the pipeline indicates that one VMC evaluation includes one forward pass through Stages A--D and then a Stage E replay over all Cartesian directions of all electrons through JVP, summarized in the figure as a cost scaling with $d \times N_{\mathrm{e}}$, where $d$ is the spatial dimensionality and $N_{\mathrm{e}}$ is the number of electrons. The rightmost inset schematically illustrates the JVP-based Laplacian evaluation in Stage E: it traces an input quantity $x$, an intermediate state $h=F(x)$, and an output $y=F'(x)$, while the paired $\nabla(\cdot)$ and $\Delta(\cdot)$ annotations indicate the gradient- and Laplacian-related quantities propagated in this procedure.
  • Figure 3: Overview of the execution pipeline of Psiformer and Orbformer in the oneqmc codebase. The upper panels show the Orbformer-specific modules used in Stage B, namely the orbital generator, nuclei message-passing neural network (MPNN), and electron Transformer, while the lower block diagram contrasts the Orbformer path with the simpler Psiformer path. Specifically, Stage B (Transformer-based electron representation learning) and Stage C (orbital/readout projection) are the main departures from deepqmc; Stages A and D (feature construction and Slater-determinant wavefunction assembly) follow the same high-level structure as in Figure \ref{['fig:paulinet-pipeline']}, while Stage E uses a Hutchinson-style Laplacian estimator and is omitted here for brevity.
  • Figure 4: Comparison of GPU runtime and memory usage across four molecules, four determinant-based NNVMC ansätze, and three GPUs (A5000, A100, and H200). From top to bottom, the panels report training runtime, inference runtime, training memory, and inference memory. Runtime uses log scale and memory uses log2 scale. Training runtime denotes one optimization step (forward, backward, and optimizer update), while inference runtime includes MCMC sampling and local-energy evaluation with model-specific Stage E Laplacian implementations. Dashed outlines indicate that Orbformer encounters out-of-memory issues on the A5000 for C$_2$H$_6$ and C$_4$H$_4$.
  • Figure 5: Kernel-level runtime breakdown for (a) PauliNet, (b) FermiNet, (c) Psiformer, and (d) Orbformer on an NVIDIA RTX A5000 GPU.
  • ...and 1 more figures