Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

Ryan Barker; Fatemeh Afghah

Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

Ryan Barker, Fatemeh Afghah

TL;DR

This work evaluates LDPC decoding for 5G NR on an accelerator-rich DGX Spark platform, comparing a Grace CPU against an integrated GB10 GPU using a Sionna-based NR-like LDPC5G chain. It finds a robust ~6× throughput improvement and substantial latency headroom when offloading LDPC from CPU to GPU, with CPU usage dominated by decoding and GPU power increases modest relative to the platform’s TDP. The study provides a conservative, high-level methodology that avoids hand-tuned kernels, offering lower-bound estimates of accelerator gains and a reusable framework for evaluating LDPC and other PHY kernels on current and next-generation AI-native RAN hardware. These results inform hardware-partitioning decisions in AI-native RAN architectures, suggesting that even modest GPUs can reclaim significant scheduling headroom and enable accelerator-centric PHY designs on Grace/Blackwell platforms, while highlighting pathways and challenges toward higher-end GH200/GB200 configurations and end-to-end digital-twin integration.

Abstract

Low-density parity-check (LDPC) decoding is one of the most computationally intensive kernels in the 5G New Radio (NR) physical layer and must complete within a 0.5\,ms transmission time interval while sharing the budget with FFT, channel estimation, demapping, HARQ, and MAC scheduling. Many open and proprietary stacks still execute LDPC on general-purpose CPUs, raising concerns about missed-slot events and limited scalability as bandwidths, modulation orders, and user multiplexing increase. This paper empirically quantifies the benefit of offloading 5G-style LDPC5G decoding from a Grace CPU to the integrated Blackwell GB10 GPU on an NVIDIA DGX~Spark platform. Using NVIDIA Sionna PHY/SYS on TensorFlow, we construct an NR-like link-level chain with an LDPC5G encoder/decoder, 16-QAM modulation, and AWGN, and sweep both the number of codewords decoded in parallel and the number of belief-propagation iterations, timing only the decoding phase while logging CPU and GPU utilization and power. Across the sweep we observe an average GPU/CPU throughput speedup of approximately $6\times$, with per-codeword CPU latency reaching $\approx 0.71$\,ms at 20 iterations (exceeding the 0.5\,ms slot), while the GB10 GPU remains within 6--24\% of the slot for the same workloads. Resource-usage measurements show that CPU-based LDPC decoding often consumes around ten Grace cores, whereas GPU-based decoding adds only $\approx10-15$\,W over GPU idle while leaving most CPU capacity available for higher-layer tasks. Because our implementation relies on high-level Sionna layers rather than hand-tuned CUDA, these results represent conservative lower bounds on achievable accelerator performance and provide a reusable, scriptable methodology for evaluating LDPC and other physical-layer kernels on future Grace/Blackwell and Aerial/ACAR/AODT platforms.

Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

TL;DR

Abstract

, with per-codeword CPU latency reaching

\,ms at 20 iterations (exceeding the 0.5\,ms slot), while the GB10 GPU remains within 6--24\% of the slot for the same workloads. Resource-usage measurements show that CPU-based LDPC decoding often consumes around ten Grace cores, whereas GPU-based decoding adds only

\,W over GPU idle while leaving most CPU capacity available for higher-layer tasks. Because our implementation relies on high-level Sionna layers rather than hand-tuned CUDA, these results represent conservative lower bounds on achievable accelerator performance and provide a reusable, scriptable methodology for evaluating LDPC and other physical-layer kernels on future Grace/Blackwell and Aerial/ACAR/AODT platforms.

Paper Structure (12 sections, 2 equations, 2 figures, 1 table)

This paper contains 12 sections, 2 equations, 2 figures, 1 table.

LDPC, 5G RAN, and Computational Ceilings
Jumpstarting DGX Spark: Experimental Setup and Measurement Methodology
Empirical Evaluation of LDPC5G Decoding on Grace–Blackwell (GB10)
Throughput Versus Decoder Iterations
Per--Codeword Latency Versus NR Slot Time
Speedup Across the Full Sweep
Resource Usage: CPU vs GPU
End-to-End Decode Time
Implications for Hardware Partitioning in AI-RAN Architectures
LDPC Acceleration and Digital-Twin Toolchains for 5G/6G RANs
Summary of Findings and Architectural Lessons for Accelerator-Centric PHY
Toward Sionna-RT and Aerial ACAR/AODT on Grace-Hopper/Grace–Blackwell and GH200/GB200

Figures (2)

Figure 1: LDPC5G throughput versus decoder iterations on DGX Spark. Each point shows the mean throughput over all batch sizes for a fixed number of iterations; text labels indicate GPU/CPU throughput speedup.
Figure 2: Resource usage during LDPC5G decoding on DGX Spark. Left: histogram of approximate Grace cores used by the LDPC Python process. Right: histogram of GB10 GPU utilization for active samples (utilization $>$ 5%), as reported by nvidia-smi.

Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

TL;DR

Abstract

Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

Authors

TL;DR

Abstract

Table of Contents

Figures (2)