High-Performance N-Queens Solver on GPU: Iterative DFS with Zero Bank Conflicts
Guangchao Yao, Yali Li
TL;DR
The paper tackles the computational challenge of counting N-Queens solutions by introducing a GPU-optimized, iterative DFS that fits the entire stack in shared memory. It reconstructs the Somers algorithm into an iterative form and employs bank-conflict-free access patterns, plus multiple optimizations (static/dynamic shared memory, last-row optimization, and inline PTX) to maximize throughput. The approach achieves substantial speedups over prior CUDA methods, solving the 27-Queens problem in 28.4 days on 8 RTX 5090 GPUs and projecting feasible timelines for larger instances, while highlighting design trade-offs in memory and partitioning. The work provides a practical blueprint for accelerating DFS-based problems on GPUs and suggests avenues for distributed scaling and applying the techniques to other combinatorial search tasks.
Abstract
The counting of solutions to the N-Queens problem is a classic NP-complete problem with extremely high computational complexity. As of now, the academic community has rigorously verified the number of solutions only up to N <= 26. In 2016, the research team led by PreuBer solved the 27-Queens problem using FPGA hardware, which took approximately one year, though the result remains unverified independently. Recent studies on GPU parallel computing suggest that verifying the 27-Queens solution would still require about 17 months, indicating excessively high time and computational resource costs. To address this challenge, we propose an innovative parallel computing method on NVIDIA GPU platform, with the following core contributions: (1) An iterative depth-first search (DFS) algorithm for solving the N-Queens problem; (2) Complete mapping of the required stack structure to GPU shared memory; (3) Effective avoidance of bank conflicts through meticulously designed memory access patterns; (4) Various optimization techniques are employed to achieve optimal performance. Under the proposed optimization framework, we successfully verified the 27-Queens problem in just 28.4 days using eight RTX 5090 GPUs, thereby confirming the correctness of PreuBer's computational results. Moreover, we have reduced the projected solving time for the next open case-the 28-Queens problem-to approximately 11 months, making its resolution computationally feasible. Compared to the state-of-the-art GPU methods, our method achieves over 10x speedup on identical hardware configurations (8 A100), while delivering over 26x acceleration when utilizing 8 RTX 5090 GPUs, and brings fresh perspectives to this long-stagnant problem.
