Table of Contents
Fetching ...

DecodeX: Exploring and Benchmarking of LDPC Decoding across CPU, GPU, and ASIC Platforms

Zhenzhou Qi, Yuncheng Yao, Yiming Li, Chung-Hsuan Tung, Junyao Zheng, Danyang Zhuo, Tingjun Chen

TL;DR

This work addresses the challenge of efficiently decoding LDPC codes in heterogeneous vRAN environments by introducing DecodeX, a cross-platform benchmarking framework that unifies CPU, GPU, and ASIC LDPC decoding implementations under a common interface. The authors implement and profile four decoding paths—DecodeCPU-FlexRAN, DecodeGPU-Aerial, DecodeASIC-ACC100, and DecodeGPU-SionnaRK—analyzing how threading, memory movements, and offload orchestration impact latency across varying $MCS$, $SNR$, and $PRB$. Key findings show that accelerator gains are strongly influenced by data movement and workload granularity, with ACC100 and GPU-based decoders delivering substantial latency reductions compared to CPU, while inline GPU paths minimize transfer overhead and yield the best end-to-end performance. DecodeX provides actionable insights for cross-platform co-design and dynamic resource management in future NextG vRANs, offering an open-source suite to benchmark, reproduce, and extend across new architectures and configurations.

Abstract

Emerging virtualized radio access networks (vRANs) demand flexible and efficient baseband processing across heterogeneous compute substrates. In this paper, we present DecodeX, a unified benchmarking framework for evaluating low-density parity-check (LDPC) decoding acceleration across different hardware platforms. DecodeX integrates a comprehensive suite of LDPC decoder implementations, including kernels, APIs, and test vectors for CPUs (FlexRAN), GPUs (Aerial and Sionna-RK), and ASIC (ACC100), and can be readily extended to additional architectures and configurations. Using DecodeX, we systematically characterize how different platforms orchestrate computation-from threading and memory management to data movement and accelerator offload-and quantify the resulting decoding latency under varying Physical layer parameters. Our observations reveal distinct trade-offs in parallel efficiency and offload overhead, showing that accelerator gains strongly depend on data-movement and workload granularity. Building on these insights, we discuss how cross-platform benchmarking can inform adaptive scheduling and co-design for future heterogeneous vRANs, enabling scalable and energy-efficient baseband processing for NextG wireless systems.

DecodeX: Exploring and Benchmarking of LDPC Decoding across CPU, GPU, and ASIC Platforms

TL;DR

This work addresses the challenge of efficiently decoding LDPC codes in heterogeneous vRAN environments by introducing DecodeX, a cross-platform benchmarking framework that unifies CPU, GPU, and ASIC LDPC decoding implementations under a common interface. The authors implement and profile four decoding paths—DecodeCPU-FlexRAN, DecodeGPU-Aerial, DecodeASIC-ACC100, and DecodeGPU-SionnaRK—analyzing how threading, memory movements, and offload orchestration impact latency across varying , , and . Key findings show that accelerator gains are strongly influenced by data movement and workload granularity, with ACC100 and GPU-based decoders delivering substantial latency reductions compared to CPU, while inline GPU paths minimize transfer overhead and yield the best end-to-end performance. DecodeX provides actionable insights for cross-platform co-design and dynamic resource management in future NextG vRANs, offering an open-source suite to benchmark, reproduce, and extend across new architectures and configurations.

Abstract

Emerging virtualized radio access networks (vRANs) demand flexible and efficient baseband processing across heterogeneous compute substrates. In this paper, we present DecodeX, a unified benchmarking framework for evaluating low-density parity-check (LDPC) decoding acceleration across different hardware platforms. DecodeX integrates a comprehensive suite of LDPC decoder implementations, including kernels, APIs, and test vectors for CPUs (FlexRAN), GPUs (Aerial and Sionna-RK), and ASIC (ACC100), and can be readily extended to additional architectures and configurations. Using DecodeX, we systematically characterize how different platforms orchestrate computation-from threading and memory management to data movement and accelerator offload-and quantify the resulting decoding latency under varying Physical layer parameters. Our observations reveal distinct trade-offs in parallel efficiency and offload overhead, showing that accelerator gains strongly depend on data-movement and workload granularity. Building on these insights, we discuss how cross-platform benchmarking can inform adaptive scheduling and co-design for future heterogeneous vRANs, enabling scalable and energy-efficient baseband processing for NextG wireless systems.

Paper Structure

This paper contains 10 sections, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: (a)--(c) Overview of L1 processing models with different hardware accelerators (HWAs): Inline Processing, Lookaside Acceleration, and Hybrid Acceleration. (d) Representative DecodeX implementations of these models: DecodeCPU-FlexRAN (Intel Xeon CPU), DecodeGPU-Aerial (NVIDIA H200, RTX 3090, and RTX 6000 Ada GPUs), DecodeASIC-ACC100 (Intel eASIC), and DecodeGPU-SionnaRK (NVIDIA Jetson Orin AGX).
  • Figure 2: Throughput comparison between sequential and bulk (Algorithm. \ref{['algo:dp-user-selection']}) LDPC enqueue and dequeue pipelines on the Intel and Silicom ACC100 platforms.
  • Figure 3: LDPC decoding latency across heterogeneous platforms under varying L1 configurations. Each heatmap shows per-TB decoding latency (ms) as a function of MCS index and SNR level, for four PRB allocations: (1st) CPU (FlexRAN, AVX512); (2nd) ACC100 lookaside acceleration; and (3rd) H200 decoding via NVIDIA's Aerial. (4th) RTX 3090 decoding via NVIDIA's Aerial. (5th) RTX 6000 Ada decoding via NVIDIA's Aerial.
  • Figure 4: PyAerial LDPC decoding performance under sequential and parallel processing. (a) GPU kernel execution time and utilization, excluding host-device data movement. (b) Overall LDPC decoding latency, including CPU orchestration and data transfer overheads.
  • Figure 5: LDPC decoding latency on DecodeGPU-SionnaRK across different iteration counts, with varying number of information bits ($K$) and code rates (CR).