Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

Songtai Lv; Yang Liang; Rui Zhu; Qibin Zheng; Haiyuan Zou

Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

Songtai Lv, Yang Liang, Rui Zhu, Qibin Zheng, Haiyuan Zou

TL;DR

This paper introduces a fine-grained FPGA-based parallel design to reduce the computational cost scaling of tensor-network algorithms, applying it to iTEBD and HOTRG. By implementing quad-tile partitioning and hardware-accelerated tensor contraction and SVD via two-sided Jacobi rotations, the approach achieves near-linear scaling with bond dimension $D_b$ for iTEBD and quadratic scaling for HOTRG, outperforming CPU and GPU implementations. The results show substantial speedups (e.g., up to ~$19.2\times$ for iTEBD and ~$24.7\times$ for HOTRG) and reveal power-law resource usage, supporting the feasibility of large-scale tensor-network acceleration on future FPGA architectures. Overall, this work establishes a principled hardware-accelerated framework that maps tensor-network computations to FPGA circuits, enabling scalable studies of complex quantum many-body systems and bridging tensor-network methods with hardware design.

Abstract

Improving the computational efficiency of quantum many-body calculations from a hardware perspective remains a critical challenge. Although field-programmable gate arrays (FPGAs) have recently been exploited to improve the computational scaling of algorithms such as Monte Carlo methods, their application to tensor network algorithms is still at an early stage. In this work, we propose a fine-grained parallel tensor network design based on FPGAs to substantially enhance the computational efficiency of two representative tensor network algorithms: the infinite time-evolving block decimation (iTEBD) and the higher-order tensor renormalization group (HOTRG). By employing a quad-tile partitioning strategy to decompose tensor elements and map them onto hardware circuits, our approach effectively translates algorithmic computational complexity into scalable hardware resource utilization, enabling an extremely high degree of parallelism on FPGAs. Compared with conventional CPU-based implementations, our scheme exhibits superior scalability in computation time, reducing the bond-dimension scaling of the computational cost from $O(D_b^3)$ to $O(D_b)$ for iTEBD and from $O(D_b^6)$ to $O(D_b^2)$ for HOTRG. This work provides a theoretical foundation for future hardware implementations of large-scale tensor network computations.

Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

TL;DR

for iTEBD and quadratic scaling for HOTRG, outperforming CPU and GPU implementations. The results show substantial speedups (e.g., up to ~

for iTEBD and ~

for HOTRG) and reveal power-law resource usage, supporting the feasibility of large-scale tensor-network acceleration on future FPGA architectures. Overall, this work establishes a principled hardware-accelerated framework that maps tensor-network computations to FPGA circuits, enabling scalable studies of complex quantum many-body systems and bridging tensor-network methods with hardware design.

Abstract

for iTEBD and from

for HOTRG. This work provides a theoretical foundation for future hardware implementations of large-scale tensor network computations.

Paper Structure (7 sections, 6 equations, 5 figures)

This paper contains 7 sections, 6 equations, 5 figures.

Introduction
High-parallel design
Algorithm framework
Quad-tile parallelism for tensor contraction and SVD
Hardware detail
Results
Conclusions

Figures (5)

Figure 1: Schematic illustration of the parallel architecture for tensor network computations on FPGAs. The input and output tensors are partitioned into multiple small blocks, each containing a finite number of tensor elements—four elements are shown as an example and are represented by colored squares. The rectangles denote SRAMs used to store these blocks. Between the input and output, tensor elements of the same color are processed concurrently by the computing resources in the computing layer (represented by trapezoids), with a fixed number of clock cycles. All computing resources associated with each color are simultaneously driven by four pipelined time sequences. Ideally, as the number of data blocks increases, the corresponding memory and computing resources can be proportionally expanded (indicated by dashed horizontal edges of the trapezoids), while the number of clock cycles required for processing remains unchanged (indicated by solid slanted edges of the trapezoids).
Figure 2: Schematic illustration of the FPGA logic structure and fabric configuration for tensor contraction (following the same conventions as Fig. 1 and gray lines in the compute layers indicate logical dependencies of specific input/output blocks), demonstrated using the example of multiplying two $4\times 4$ matrices according to the strategy in Eq. (4). The input tensors $A$, $B$, and the output tensor $M$, are each partitioned into four $2\times 2$ blocks, which are individually assigned to corresponding SRAMs, e.g., $A_{I_mK_n}$. Pairwise multiplication of the block matrices produces eight intermediate blocks of size $2\times 2$. Summation of these intermediate blocks over the $K$ index then yields the output tensor $M$. For larger tensors, parallelism can be achieved by increasing the number of block indices $I_m$, $K_n$, etc. In this case, the computation time of the first computing layer remains unchanged, while the computation time of the second computing layer grows linearly with the dimension of the original tensors.
Figure 3: Schematic of the logical architecture and configuration of the SVD on FPGA (following the same conventions as Fig. 1 and gray lines in the compute layers indicate logical dependencies of specific input/output blocks). As an illustration, we consider the SVD of an $8 \times 8$ Hermitian matrix $M$. Using the quad-tile partitioning scheme, $M$ on the right is divided into four diagonal blocks $M_{I_iI_i}$ and six off-diagonal blocks $M_{I_iI_j}$ ($i\neq j$) as input tiles. $M_{I_iI_i}$ are diagonalized via Jacobi rotations (upper compute layer), producing rotation angles $\theta_{I_i}^{l/r}$, which are used to construct the modules $U$ and $V$. Applying the corresponding rotations determined by $\theta_{I_i}^{l/r}$ to all input blocks $M_{I_iI_j}$ yields the updated output blocks $M_{I_iI_j}$ (lower compute layer). After systolic data exchanges, these outputs are fed back as new inputs to iterate the above procedure. For larger matrices, parallelism can be achieved by increasing $i$ in the index $I_i$, while the computation time of both compute layers remains constant independent of $i$.
Figure 4: The computation time per step at different $D_b$ of a, iTEBD calculation, and b, HOTRG calculation for the one-dimensional AF Heisenberg chain on different platforms. The red solid squares, black hollow squares, orange diamonds, and blue circles represent the computation time for FPGA in pipelined parallel style (piped), FPGA in unpipelined parallel style (unpiped), GPU and CPU, respectively. The error bars indicate two times the standard deviation. The solid lines represent the fitted results of the computation time, where the form of the fitting function is $D_b^{x}$ with the fitting parameter $x$. In a, the fitted results for $x$ for FPGA in pipelined parallel style, FPGA in unpipelined parallel style, GPU and CPU are 1.11, 1.09, 1.14 and 2.94, respectively. The gray pentagon denotes the computation time of FPGA in pipelined parallel style in our previous work Lv2025 with scaling behavior $D_b^{2.88}$. In b, the corresponding results of $x$ are 2.10, 2.08, 2.89 and 6.04, respectively. The insets illustrate the corresponding data and fit in log-log scale.
Figure 5: The hardware resource usage at different $D_b$ of the iTEBD ( a, b, c, d) and the HOTRG ( e, f, g, h) calculation for the one-dimensional AF Heisenberg chain for two different FPGA styles. Plotting in log-log scale, the red solid squares and black hollow squares represent the hardware resource usage for FPGA in pipelined parallel style (piped) and FPGA in unpipelined parallel style (unpiped), respectively. The red and black solid lines represent the fitted results for the pipelined and unpipelined style with the fitting function $D_b^x$, respectively, where $x$ is the fitting parameter. The fitted results for $x$ of BRAM, DSP, FF and LUT for iTEBD calculation in pipelined (unpipelined) style are 3.38 (0.92), 1.91 (1.87), 1.77 (1.91) and 1.53 (1.85), respectively. And the corresponding fitted results for $x$ for HOTRG calculation are 3.32 (1.95), 3.04 (2.89), 3.01 (2.95) and 2.93 (2.93).

Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

TL;DR

Abstract

Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

Authors

TL;DR

Abstract

Table of Contents

Figures (5)