Table of Contents
Fetching ...

SUperman: Efficient Permanent Computation on GPUs

Deniz Elbek, Fatih Taşyaran, Bora Uçar, Kamer Kaya

TL;DR

SUperman addresses the challenge of exact permanent computation, a #P-complete problem, by designing a GPU-optimized, multi-node software suite that extends Ryser-based approaches with Gray-code optimizations and architecture-aware memory strategies. The framework supports dense and sparse matrices across real and complex domains, incorporating preprocessing (DM and Forbert–Marx decompositions) and precision-enhancement techniques (quad precision outer sums and compensated summation) to deliver substantial speedups over CPU baselines and to enable large-scale records (e.g., $62\times62$ on 192 GPUs in ~1.63 days). Key contributions include coalesced memory access patterns, per-thread register usage for the $x$ array, and flexible OpenMP/MPI-based parallelism, plus accuracy-guided strategies validated on known matrices. The results indicate meaningful practical impact for applications in quantum computing, physics, and combinatorics, offering a reusable GPU/HPC solution for permanents and setting records for large instances. Future work points to hybrid/extended precision, Python wrappers, and broader ecosystem integration to broaden accessibility and applicability.

Abstract

The permanent is a function, defined for a square matrix, with applications in various domains including quantum computing, statistical physics, complexity theory, combinatorics, and graph theory. Its formula is similar to that of the determinant; however, unlike the determinant, its exact computation is #P-complete, i.e., there is no algorithm to compute the permanent in polynomial time unless P=NP. For an $n \times n$ matrix, the fastest algorithm has a time complexity of $O(2^{n-1}n)$. Although supercomputers have been employed for permanent computation before, there is no work and, more importantly, no publicly available software that leverages cutting-edge High-Performance Computing accelerators such as GPUs. In this work, we design, develop, and investigate the performance of SUperman, a complete software suite that can compute matrix permanents on multiple nodes/GPUs on a cluster while handling various matrix types, e.g., real/complex/binary and sparse/dense, etc., with a unique treatment for each type. SUperman run on a single Nvidia A100 GPU is up to $86\times$ faster than a state-of-the-art parallel algorithm on 44 Intel Xeon cores running at 2.10GHz. Leveraging 192 GPUs, SUperman computes the permanent of a $62 \times 62$ matrix in 1.63 days, marking the largest reported permanent computation to date.

SUperman: Efficient Permanent Computation on GPUs

TL;DR

SUperman addresses the challenge of exact permanent computation, a #P-complete problem, by designing a GPU-optimized, multi-node software suite that extends Ryser-based approaches with Gray-code optimizations and architecture-aware memory strategies. The framework supports dense and sparse matrices across real and complex domains, incorporating preprocessing (DM and Forbert–Marx decompositions) and precision-enhancement techniques (quad precision outer sums and compensated summation) to deliver substantial speedups over CPU baselines and to enable large-scale records (e.g., on 192 GPUs in ~1.63 days). Key contributions include coalesced memory access patterns, per-thread register usage for the array, and flexible OpenMP/MPI-based parallelism, plus accuracy-guided strategies validated on known matrices. The results indicate meaningful practical impact for applications in quantum computing, physics, and combinatorics, offering a reusable GPU/HPC solution for permanents and setting records for large instances. Future work points to hybrid/extended precision, Python wrappers, and broader ecosystem integration to broaden accessibility and applicability.

Abstract

The permanent is a function, defined for a square matrix, with applications in various domains including quantum computing, statistical physics, complexity theory, combinatorics, and graph theory. Its formula is similar to that of the determinant; however, unlike the determinant, its exact computation is #P-complete, i.e., there is no algorithm to compute the permanent in polynomial time unless P=NP. For an matrix, the fastest algorithm has a time complexity of . Although supercomputers have been employed for permanent computation before, there is no work and, more importantly, no publicly available software that leverages cutting-edge High-Performance Computing accelerators such as GPUs. In this work, we design, develop, and investigate the performance of SUperman, a complete software suite that can compute matrix permanents on multiple nodes/GPUs on a cluster while handling various matrix types, e.g., real/complex/binary and sparse/dense, etc., with a unique treatment for each type. SUperman run on a single Nvidia A100 GPU is up to faster than a state-of-the-art parallel algorithm on 44 Intel Xeon cores running at 2.10GHz. Leveraging 192 GPUs, SUperman computes the permanent of a matrix in 1.63 days, marking the largest reported permanent computation to date.

Paper Structure

This paper contains 18 sections, 14 equations, 8 figures, 5 tables, 4 algorithms.

Figures (8)

  • Figure 1: CRS and CCS formats for a $6 \times 6$ matrix with 13 nonzeros. In CRS/CCS, the first element of rptrs/ cptrs arrays is 0, and their last element is equal to 13. In CRS, the cids and vals arrays store the nonzeros in row-major order, whereas in CCS, the rids and vals arrays store them in column-major order.
  • Figure 2: Row-major (top) and column-major (bottom) layout storage when the matrix ${\bf A }$ is kept in global memory. Ryser requires access to matrix columns, which implies strided access to the matrix elements for row-major storage. This causes many cache misses for CPUs. On the other hand, with the column-major layout, once a CPU thread touches a location, it fetches the adjacent content in the same cache line to its cache. However, the column-major layout is problematic for GPUs. GPU threads act as teams of 32 and always execute their (memory) operations at the same time. When the threads access far-away locations, it causes stalls due to expensive memory access. The impact can be mitigated when the matrix is stored in row-major format.
  • Figure 3: 67 consecutive 32-bit words in GPU shared memory are distributed to the 32 memory banks in a round-robin fashion. The threads inside three warps perform memory access: the first warp requests three different elements from Bank 0, hence, a 3-way bank conflict occurs. All the threads in the second warp request the same item from Bank 3. This is called a broadcast; there is no bank conflict since no serialisation occurs. All of the third warp's accesses are to different banks, and there is no bank conflict for these accesses.
  • Figure 4: The changed bits ($y$-axis) of 4 threads with a chunk size equal to 17. Different colours represent different threads. The numbers above and below are local and global iteration IDs.
  • Figure 5: The changed bits ($y$-axis) of 4 threads with a chunk size equal to 16. Different colours represent different threads. The numbers above and below are local and global iteration IDs.
  • ...and 3 more figures