A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

Isaac Llorente-Saguer

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

Isaac Llorente-Saguer

TL;DR

This work migrates the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path, and introduces an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming.

Abstract

We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

TL;DR

Abstract

% parallel efficiency at 2 GPUs and

% at

GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of

. On the same hardware, the new architecture achieves a

algorithmic speedup over its host-coupled predecessor at N =

. End-to-end, the framework verifies Goldbach's conjecture up to

seconds on a single NVIDIA RTX 5090, and up to

seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.

Paper Structure (32 sections, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 1 equation, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
CPU-based verification.
GPU-based approaches.
GoldbachGPU v1.
Systems Architecture
GPU-Native Segment Sieving via L1 Shared Memory
Lock-Free Asynchronous Work-Stealing Pool
Phase 2: Optimised CPU Fallback
Correctness Guarantees and Overflow Safety
Sieve arithmetic.
128-bit Miller–Rabin oracle.
Graceful error handling.
Deployment and CLI Interface
Reproducibility Checklist
...and 17 more sections

Figures (4)

Figure 1: High-level decoupled architecture of GoldbachGPU v2.0. In contrast to v1, where the CPU sieved each segment and transferred the resulting bitset via PCIe, v2.0 generates segment bitsets natively in GPU L1 Shared Memory. A device-side reduction step creates a "Zero-Copy Fast Path" that entirely bypasses the host-device bus during normal operation.
Figure 2: Explicit systems architecture and execution pipeline for a single GPU worker in GoldbachGPU v2.0. Host threads (one per GPU) coordinate via an atomic work queue; initialisation performs VRAM allocation and async host$\rightarrow$device copies. The GPU runs device‑side kernels that build segment bitsets in L1; a device reduction decides whether to trigger a PCIe device$\rightarrow$host transfer and CPU Phase‑2 fallback, otherwise a zero‑copy fast path avoids PCIe.
Figure 3: Log--log runtime scaling on a single RTX 5090. All three implementations ran on the same hardware. The CPU baseline (black dashed) and v1 host-coupled implementation (red) both exhibit I/O-dominated scaling. The v2 device-native implementation (blue) achieves $13.2\times$ speedup at $N = 10^9$, growing to $45.6\times$ at $N = 10^{10}$, as the PCIe overhead eliminated by v2 accumulates proportionally with the number of segments.
Figure 4: Parallel efficiency of the lock-free work-stealing pool at $N = 2 \times 10^{12}$ on RTX 5090 hardware (GB202H Blackwell), from the Nsight Systems profiling campaign of §\ref{['sec:profiling']}. All three configurations use matched $T_1 = 80.865$ s; efficiency is $\eta_k = T_1 / (k\,T_k)$. The monotonic decrease from 99.7% at 2 GPUs to 98.6% at 4 GPUs is consistent with the terminal segment-drain effect scaling as $O(k/S)$, where $S$ is the total number of segments.

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

TL;DR

Abstract

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

Authors

TL;DR

Abstract

Table of Contents

Figures (4)