Exploiting Parallelism for Fast Feynman Diagrammatics

John Sturt; Evgeny Kozik

Exploiting Parallelism for Fast Feynman Diagrammatics

John Sturt, Evgeny Kozik

TL;DR

This paper tackles the computational bottleneck of high-order Feynman diagram sums in Diagrammatic Monte Carlo by introducing GPU-accelerated CoS, which encodes the sum over diagrams as a directed graph and evaluates it with a factorised, parallelizable structure. The CoS approach reduces the apparent factorial complexity to $\mathcal{O}(n^2 2^n)$ operations and, when mapped onto GPUs via graph flattening and cooperative thread blocks, yields orders-of-magnitude speed-ups over CPU implementations. The authors demonstrate performance gains across consumer to server GPUs, achieving acceleration up to about $10^3$ times and enabling studies of strong-correlation physics previously out of reach. They discuss limitations such as memory bandwidth and synchronization, and outline future directions including cross-platform accelerator support, mixed precision, and tensor-train representations. Overall, the work lowers the practical barrier to high-order DiagMC and broadens access to nonperturbative quantum many-body phenomena.

Abstract

Diagrammatic expansions are a paradigmatic and powerful tool of quantum many-body theory. Their evaluation to high order, e.g., by the Diagrammatic Monte Carlo technique, can provide unbiased results in strongly correlated and challenging regimes. However, calculating a factorial number of terms to acceptable precision remains very costly even for state-of-the-art methods. We achieve a dramatic acceleration of evaluating Feynman's diagrammatic series by use of specialised hardware architecture within the recently introduced combinatorial summation (CoS) framework. We present how exploiting the massive parallelism and concurrency available from GPUs leads to orders of magnitude improvement in computation time even on consumer-grade hardware. This provides a platform for making probes of novel phenomena of strong correlations much more accessible.

Exploiting Parallelism for Fast Feynman Diagrammatics

TL;DR

operations and, when mapped onto GPUs via graph flattening and cooperative thread blocks, yields orders-of-magnitude speed-ups over CPU implementations. The authors demonstrate performance gains across consumer to server GPUs, achieving acceleration up to about

times and enabling studies of strong-correlation physics previously out of reach. They discuss limitations such as memory bandwidth and synchronization, and outline future directions including cross-platform accelerator support, mixed precision, and tensor-train representations. Overall, the work lowers the practical barrier to high-order DiagMC and broadens access to nonperturbative quantum many-body phenomena.

Abstract

Paper Structure (16 sections, 7 figures)

This paper contains 16 sections, 7 figures.

Introduction
Parallelisation of Feynman diagram summation
Mental model of parallelism
CoS Algorithm
Parallelising CoS
Results
Discussion
Discussion
Acknowledgements
CUDA Introduction
Execution Model
SMs and Occupancy
Synchronicity
Memory Hierarchies
Host-Device Latency
...and 1 more sections

Figures (7)

Figure 1: Illustration of the technique for combinatorial summation (CoS) of the integrands of all connected Feynman diagrams of order $n$ by means of a directed graph Kozik2024CombinatorialDiagrams. The terms are constructed from the Greens functions $G_{\alpha \beta}$ and interaction potentials $V_{\alpha \beta}$ between vertices $\alpha, \beta$ with the coordinates in space-imaginary time $\alpha=(\mathbf{r}_\alpha, t_\alpha)$, $\beta=(\mathbf{r}_\beta, t_\beta)$. The top node of the graph has the value $1$; each node accumulates a sum of all contributions from its incoming edges; each edge transfers the value of its origin node to the destination node multiplied by the corresponding Green's function; the bottom node gives the result of the calculation.
Figure 2: Illustration of the graph flattening transformation for parallel processing. LHS: Example of several levels of a digraph with $L$ many levels and $N$ many nodes. The sum of all diagrams is accumulated into the final node $N$ as the total of the contributions from each path through the graph. RHS: The same graph, now represented as a flattened array of edges. Each blue box represents a level of the graph, the edges of which can now be processed in parallel.
Figure 3: Speed-up seen from transforming the graph to the "flat" representation illustrated in Fig. \ref{['fig:combinedgraphflat']} on the CPU relative to the original implementation of Ref. Kozik2024CombinatorialDiagrams, demonstrating significant acceleration just by a difference in implementation detail, which is achieved here due to a more favourable memory access pattern.
Figure 4: Illustration of how a window iterates through the flattened-representation of the graph. If the size of the level is larger than the window we have to perform a move within the level. (a) Position of the window during the first step through a graph evaluation. (b) Window position after having computed the first level, shifting to the start of the second level.
Figure 5: Number of evaluations of the sum of all diagram integrands of order $n$ per second for several types of hardware. The original implementation of Ref. Kozik2024CombinatorialDiagrams executed on a state-of-the-art CPU is labelled as Ref. Kozik2024CombinatorialDiagrams CPU. All other data are obtained with the flattened graph of Fig. \ref{['fig:combinedgraphflat']} on the same CPU (labelled CPU) and Nvidia GPU cards: RTX3090, L40S and H100, using single-precision arithmetic. Inset: The corresponding number of floating point operations per second (FLOPS) at each diagram order $n$.
...and 2 more figures

Exploiting Parallelism for Fast Feynman Diagrammatics

TL;DR

Abstract

Exploiting Parallelism for Fast Feynman Diagrammatics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)