Table of Contents
Fetching ...

TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios

Yichao Zhang, Marco Bertuletti, Samuel Riedel, Matheus Cavalcante, Alessandro Vanelli-Coralli, Luca Benini

TL;DR

The paper tackles the demand for high-throughput, power-efficient SDR processing in next-generation RANs. It introduces TeraPool-SDR, a physically-aware, three-level hierarchical many-core cluster of $1024$ Snitch RV32 cores sharing a $4\,\mathrm{MiB}$ L1 with $4096$ banks, implemented in GF\!12nm LP+. The authors detail the Tile/SubGroup/Group architecture, the interconnect design, and a full physical implementation, including latency-throughput trade-offs and area/power analyses, with open-source emphasis. Across FFT, MatMul, Channel Estimation, and Linear System Inversion kernels, the design achieves high energy efficiency (GOPS/W) and practical power consumption (<10 W) while reaching up to $1.89\mathrm{TOPS}$ at GHz frequencies, suggesting a viable path for open, programmable PHY stacks in 5G/6G basestations and potential 3D-IC scaling.

Abstract

Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 °C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards.

TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios

TL;DR

The paper tackles the demand for high-throughput, power-efficient SDR processing in next-generation RANs. It introduces TeraPool-SDR, a physically-aware, three-level hierarchical many-core cluster of Snitch RV32 cores sharing a L1 with banks, implemented in GF\!12nm LP+. The authors detail the Tile/SubGroup/Group architecture, the interconnect design, and a full physical implementation, including latency-throughput trade-offs and area/power analyses, with open-source emphasis. Across FFT, MatMul, Channel Estimation, and Linear System Inversion kernels, the design achieves high energy efficiency (GOPS/W) and practical power consumption (<10 W) while reaching up to at GHz frequencies, suggesting a viable path for open, programmable PHY stacks in 5G/6G basestations and potential 3D-IC scaling.

Abstract

Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 °C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards.
Paper Structure (10 sections, 6 figures, 3 tables)

This paper contains 10 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The TeraPool-SDR Tile architecture, the crossbar interconnections protocol specified in \ref{['sec:architecture']}.
  • Figure 2: Bottom-up architecture of $\text{TeraPool-SDR}_{\text{1-3-5-7}}$, with the interconnection protocol specified in \ref{['sec:architecture']}.
  • Figure 3: Throughput and average round-trip latency of TeraPool-SDR's L1 interconnect as a function of the load.
  • Figure 4: Placed-and-routed layout annotated view of each TeraPool-SDR hierarchical instance.
  • Figure 5: Fraction of instructions and stalls over the total cycles for the kernels execution in $\text{TeraPool-SDR}_{\text{1-3-5-7}}$.
  • ...and 1 more figures