Table of Contents
Fetching ...

FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support

Tim Fischer, Michael Rogenmoser, Thomas Benz, Frank K. Gürkaynak, Luca Benini

TL;DR

FlooNoC targets the bandwidth-intensive data movement needs of next-generation AI accelerators by marrying very wide AXI4-linked channels with an end-to-end ordering mechanism and a lightweight, scalable NoC router. The design uses wide physical links, physical-channel routing, and a RoB-less (or reorder-capable) NI coupled with a multi-stream DMA to eliminate inter-stream dependencies and achieve high throughput with modest area overhead. In a physical 12nm FinFET implementation, an 8×4 compute mesh with 288 RISC-V cores reaches up to 645 Gbps per link and 103 Tbps aggregate bandwidth, while delivering 0.15 pJ/B/hop energy efficiency and a 3.5% tile-area overhead. Compared to a traditional AXI4-based multi-layer interconnect and state-of-the-art NoCs, FlooNoC delivers up to 30% area reduction and more than 2× tile-to-tile bandwidth, with up to threefold improvements in energy efficiency. The work provides an open-source, end-to-end AXI4-compatible solution that scales with high-bandwidth workloads, enabling efficient, bulk data transfer in large-scale accelerator platforms.

Abstract

The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this paper, we address this critical need by introducing the FlooNoC Network-on-Chip (NoC), featuring very wide, fully Advanced eXtensible Interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, non-blocking transactions are supported for latency tolerance. Additionally, a novel end-to-end ordering approach for AXI4, enabled by a multi-stream capable Direct Memory Access (DMA) engine simplifies network interfaces and eliminates inter-stream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Utilizing wide links on high levels of metal, we achieve a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps for an 8x4 mesh of processors cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared to state-of-the-art NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared to a traditional AXI4-based multi-layer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.

FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support

TL;DR

FlooNoC targets the bandwidth-intensive data movement needs of next-generation AI accelerators by marrying very wide AXI4-linked channels with an end-to-end ordering mechanism and a lightweight, scalable NoC router. The design uses wide physical links, physical-channel routing, and a RoB-less (or reorder-capable) NI coupled with a multi-stream DMA to eliminate inter-stream dependencies and achieve high throughput with modest area overhead. In a physical 12nm FinFET implementation, an 8×4 compute mesh with 288 RISC-V cores reaches up to 645 Gbps per link and 103 Tbps aggregate bandwidth, while delivering 0.15 pJ/B/hop energy efficiency and a 3.5% tile-area overhead. Compared to a traditional AXI4-based multi-layer interconnect and state-of-the-art NoCs, FlooNoC delivers up to 30% area reduction and more than 2× tile-to-tile bandwidth, with up to threefold improvements in energy efficiency. The work provides an open-source, end-to-end AXI4-compatible solution that scales with high-bandwidth workloads, enabling efficient, bulk data transfer in large-scale accelerator platforms.

Abstract

The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this paper, we address this critical need by introducing the FlooNoC Network-on-Chip (NoC), featuring very wide, fully Advanced eXtensible Interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, non-blocking transactions are supported for latency tolerance. Additionally, a novel end-to-end ordering approach for AXI4, enabled by a multi-stream capable Direct Memory Access (DMA) engine simplifies network interfaces and eliminates inter-stream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Utilizing wide links on high levels of metal, we achieve a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps for an 8x4 mesh of processors cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared to state-of-the-art NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared to a traditional AXI4-based multi-layer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.
Paper Structure (31 sections, 11 figures, 3 tables)

This paper contains 31 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Technology scaling of on-chip wire resources based on IDRS reports IRDS for 2-14nm and interco_scaling16 for 22-65nm.
  • Figure 2: Network Interface architecture for request (AR/AW/W) and response (R/B) paths. Reads and writes are independent in , and the request/response paths are very similar in the and are depicted as overlapping modules. The Ordering Unit can be configured with or without reordering capabilities (i.e. RoB and RoB-less).
  • Figure 3: Example of a single flit, consisting of header information and an AXI W beat payload of 512.
  • Figure 4: Router microarchitecture of a 5$\times$5 configuration with three links. Routing information such as local coordinates and routing tables are passed from externally.
  • Figure 5: Top: Compute tile consisting of a Snitch cluster with nine RISC-V compute cores, one of which is tightly-coupled to a engine capable of handling $C$ streams in parallel. An - attached to the wide 512-bit and narrow 64-bit bus, and separate 5$\times$5 router for each physical link in the narrow-wide network. Bottom: Compute mesh of 8$\times$4 compute tiles. The compute mesh connects to the on the left, to -link for off-chip communication on the bottom.
  • ...and 6 more figures