Table of Contents
Fetching ...

Scalable hybrid quantum Monte Carlo simulation of U(1) gauge field coupled to fermions on GPU

Kexin Feng, Chuang Chen, Zi Yang Meng

TL;DR

This work develops a GPU-accelerated hybrid quantum Monte Carlo approach for simulating a $U(1)$ gauge field coupled to fermions in (2+1)D (non-compact QED$_3$ with $N_f=2$). By introducing a problem-specific preconditioner, matrix-free matrix–vector multiplications, and CUDA Graph optimization, the authors achieve nearly linear scaling $O(N_ au V_s)$, enabling system sizes up to $N_ au \times L^2 = 660 \times 66^2$. They measure fermionic bilinear and current correlators, finding power-law decays with adjoint scaling dimensions $\Delta_{ m adj} \in (1.75,1.95)$ and a conserved-current dimension $\Delta_J=2$, consistent with a conformal $U(1)$ Dirac spin liquid (DSL) in QED$_3$. The computational advances provide a scalable framework for studying DSL stability and transitions to symmetry-breaking phases at large scales and offer deeper connections to field-theoretical predictions and potential experiments.

Abstract

We develop a GPU-accelerated hybrid quantum Monte Carlo (QMC) algorithm to solve the fundamental yet difficult problem of $U(1)$ gauge field coupled to fermions, which gives rise to a $U(1)$ Dirac spin liquid state under the description of (2+1)d quantum electrodynamics QED$_3$. The algorithm renders a good acceptance rate and, more importantly, nearly linear space-time volume scaling in computational complexity $O(N_τ V_s)$, where $N_τ$ is the imaginary time dimension and $V_s$ is spatial volume, which is much more efficient than determinant QMC with scaling behavior of $O(N_τV_s^3)$. Such acceleration is achieved via a collection of technical improvements, including (i) the design of the efficient problem-specific preconditioner, (ii) customized CUDA kernel for matrix-vector multiplication, and (iii) CUDA Graph implementation on the GPU. These advances allow us to simulate the $U(1)$ Dirac spin liquid state with unprecedentedly large system sizes, which is up to $N_τ\times L\times L = 660\times66\times66$, and reveal its novel properties. With these technical improvements, we see the asymptotic convergence in the scaling dimensions of various fermion bilinear operators and the conserved current operator when approaching the thermodynamic limit. The scaling dimensions find good agreement with field-theoretical expectation, which provides supporting evidence for the conformal nature of the $U(1)$ Dirac spin liquid state in the \qed. Our technical advancements open an avenue to study the Dirac spin liquid state and its transition towards symmetry-breaking phases at larger system sizes and with less computational burden.

Scalable hybrid quantum Monte Carlo simulation of U(1) gauge field coupled to fermions on GPU

TL;DR

This work develops a GPU-accelerated hybrid quantum Monte Carlo approach for simulating a gauge field coupled to fermions in (2+1)D (non-compact QED with ). By introducing a problem-specific preconditioner, matrix-free matrix–vector multiplications, and CUDA Graph optimization, the authors achieve nearly linear scaling , enabling system sizes up to . They measure fermionic bilinear and current correlators, finding power-law decays with adjoint scaling dimensions and a conserved-current dimension , consistent with a conformal Dirac spin liquid (DSL) in QED. The computational advances provide a scalable framework for studying DSL stability and transitions to symmetry-breaking phases at large scales and offer deeper connections to field-theoretical predictions and potential experiments.

Abstract

We develop a GPU-accelerated hybrid quantum Monte Carlo (QMC) algorithm to solve the fundamental yet difficult problem of gauge field coupled to fermions, which gives rise to a Dirac spin liquid state under the description of (2+1)d quantum electrodynamics QED. The algorithm renders a good acceptance rate and, more importantly, nearly linear space-time volume scaling in computational complexity , where is the imaginary time dimension and is spatial volume, which is much more efficient than determinant QMC with scaling behavior of . Such acceleration is achieved via a collection of technical improvements, including (i) the design of the efficient problem-specific preconditioner, (ii) customized CUDA kernel for matrix-vector multiplication, and (iii) CUDA Graph implementation on the GPU. These advances allow us to simulate the Dirac spin liquid state with unprecedentedly large system sizes, which is up to , and reveal its novel properties. With these technical improvements, we see the asymptotic convergence in the scaling dimensions of various fermion bilinear operators and the conserved current operator when approaching the thermodynamic limit. The scaling dimensions find good agreement with field-theoretical expectation, which provides supporting evidence for the conformal nature of the Dirac spin liquid state in the \qed. Our technical advancements open an avenue to study the Dirac spin liquid state and its transition towards symmetry-breaking phases at larger system sizes and with less computational burden.

Paper Structure

This paper contains 6 sections, 33 equations, 11 figures.

Figures (11)

  • Figure 1: Lattice model of QED$_3$. Fermions live on the sites of the lattice represented by blue dots, while gauge fields live on the bonds of the lattice represented by yellow dots. Fermion hops between nearest-neighbor sites with a phase $e^{i \phi_{ij}}$. The plot shows two adjacent square lattice layers, which are connected with temporal bonds represented as black dashed line. Gauge fields on these temporal bonds are set to be zero. Within a square lattice layer, the four bond colors represent the four families of bonds, which will be used in the checkerboard decomposition, as shown in Eq. \ref{['eq:eq24']}; within each family, the fermion hopping terms represented by the bonds all mutually commute. In the bottom right cube, the circles on the facets denote the positive directions of the magnetic flux going out of the cube. The magnetic flux through a space-time plaquette $\Box$ is defined as $\theta^{\mu\nu} = \sum_{ij\in \textmd{edge}(\Box)}\phi_{ij}$, where $\mu,\nu \in \{x, y, \tau\}$ denote the plaquette orientation, $ij$ are along the positive direction, and $\phi_{ij} = -\phi_{ji}$.
  • Figure 2: The speedup achieved by customized CUDA kernel and CUDA Graph, and the computational complexity of HQMC.(a) shows the speedup achieved by customized CUDA kernel for different system sizes $L$, with $V_s=L^2$ and $N_\tau = 10L$. The blue axis on the left represents the latency of the Monte Carlo sweep which is average time cost per generated sample. The light blue line represents the baseline performance, where the PyTorch sparse-matrix-vector multiplication is applied to compute $M^\dagger M v$ and $\tilde{O}^{-1}v$. The dark blue line is the latency of the customized CUDA kernel, where the computations $M^\dagger M v$ and $\tilde{O}^{-1}v$ are replaced with optimized CUDA kernel. The red dashed line is the latency ratio, which shows the speedup between the two implementations for each system size. (b) shows the speedup achieved by CUDA Graph, for different system sizes $L$. The light blue line represents the baseline performance with CUDA Graph turned off, while the dark blue line is the latency when CUDA Graph is turned on. The red dashed line is the latency ratio, which shows the speedup between the two implementations for each system size. (c) shows the scaling behavior of the HQMC sample generation latency and HQMC measurement latency using stochastic estimation (SE), as functions of $N_\tau V_s \propto L^3$, where $N_\tau=\beta/\Delta_\tau = 10L$. The latency (second per sample) is defined as the time spent to generate a gauge sample, or the time to compute a fermionic observable from a gauge sample for SE (the SE latency here is tested on spin-spin correlation using $N_\textmd{rv}=40$). The blue and orange line are the linear fitting of five large-size points. They show that the HQMC algorithm has a time complexity of $O((L^3)^{0.9})$, while the stochastic estimation has a time complexity of $O((L^3)^{1.3})$. As a contrast, the DQMC has a time complexity of $O(N_\tau V_s^3) = O(L^7)$, which is plotted as the green dashed line. The starting point of the green dashed line is $0.38$ second per sweep at $L^3 = 10^3$, obtained from DQMC benchmark.
  • Figure 3: The correlation functions for non-compact QED$_3$ with $K=1$ and $J=1.25$.(a) shows the log plot of the spin-spin correlation $C_S^{\uparrow\downarrow}(r, 0)$ as a function of distance $r$, for various cubic lattice sizes shown as $N_\tau \times L^2$ in the legend. Notice that $C_S^{\uparrow\downarrow} = C_S^{\downarrow\uparrow}$, so we only need to measure $C_S^{\uparrow\downarrow}$ here. To avoid the even-odd oscillation in the finite-size data, we only plot the data of the distance $r = \textmd{odd}$ points. The gray square, triangular and diamond dots show the results computed from the DQMC to benchmark the HQMC result of the corresponding lattice sizes. At the large distance and the large system sizes, the data are manually fitted by the black solid line, which displays a power law decay as $1/r^{3.8}$. (b) shows the log plot of the bond-bond correlation (along $\hat{x}$ direction) as a function of distance $r$. The legend is the same as that in (a). At the large distance and the large system sizes, the manually fit line of the bond-bond correlation also displays a power law decay as $1/r^{3.7}$. (c) shows the log plot of the flux-flux correlation as a function of imaginary time $\tau$. At the large distance and the large system sizes, the black solid line shows the manual fitting which displays a power law decay as $\tau^{-4.0}$. The black dashed line shows the $\tau^{-3.0}$ power law behavior for comparison.
  • Figure 4: The convergence of the preconditioned-CG versus size. The semi-log plot of the pCG convergence error $\varepsilon := \frac{||v - OX_n||}{||v||}$ as a function of lattice linear size $L$ for various fixed max number of pCG iterations. In the benchmark, the pCG convergence error are measured and averaged in a MC process, where we set $J=1.25$, $\Delta_\tau=0.1$. The total cubic lattice sizes for each $L$ is $N_\tau V_s = 10L^3$. At large lattice sizes, the convergence error levels off, which indicates that the number of pCG iterations required to converge has a constant complexity.
  • Figure 5: Banded preconditioner matrix and its CUDA implementation. (a) The plot shows the banded sparsity pattern of the preconditioner $\tilde{O}^{-1}$, where the nonzero elements of the first 500 by 500 entries are displayed. The bandwidth is 6. This preconditioner is obtained for a lattice of $N_\tau\times L\times L = 60 \times 6 \times 6$. Two neighboring diagonal lines are separated by a distance of $V_s = L\times L$ which is $36$ in this illustration. (b) Schematic of the matrix-free preconditioner-vector multiplication $u = \tilde{O}^{-1} v$, which is illustrated with a banded preconditioner matrix with bandwidth $1$. A block of threads are assigned to process the rows of the preconditioner matrix sandwiched by the orange dashed line, with one thread processing one row. The chunk of elements in $v$ involved in this computation are denoted by blue dots, where light blue ones are called the halo elements. In our CUDA kernel implementation, the chunk of $v$ represented by the blue dots are copied into shared memory, which utilizes the local caching of the GPU and reduces the communication overhead.
  • ...and 6 more figures