FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support
Tim Fischer, Michael Rogenmoser, Thomas Benz, Frank K. Gürkaynak, Luca Benini
TL;DR
FlooNoC targets the bandwidth-intensive data movement needs of next-generation AI accelerators by marrying very wide AXI4-linked channels with an end-to-end ordering mechanism and a lightweight, scalable NoC router. The design uses wide physical links, physical-channel routing, and a RoB-less (or reorder-capable) NI coupled with a multi-stream DMA to eliminate inter-stream dependencies and achieve high throughput with modest area overhead. In a physical 12nm FinFET implementation, an 8×4 compute mesh with 288 RISC-V cores reaches up to 645 Gbps per link and 103 Tbps aggregate bandwidth, while delivering 0.15 pJ/B/hop energy efficiency and a 3.5% tile-area overhead. Compared to a traditional AXI4-based multi-layer interconnect and state-of-the-art NoCs, FlooNoC delivers up to 30% area reduction and more than 2× tile-to-tile bandwidth, with up to threefold improvements in energy efficiency. The work provides an open-source, end-to-end AXI4-compatible solution that scales with high-bandwidth workloads, enabling efficient, bulk data transfer in large-scale accelerator platforms.
Abstract
The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this paper, we address this critical need by introducing the FlooNoC Network-on-Chip (NoC), featuring very wide, fully Advanced eXtensible Interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, non-blocking transactions are supported for latency tolerance. Additionally, a novel end-to-end ordering approach for AXI4, enabled by a multi-stream capable Direct Memory Access (DMA) engine simplifies network interfaces and eliminates inter-stream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Utilizing wide links on high levels of metal, we achieve a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps for an 8x4 mesh of processors cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared to state-of-the-art NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared to a traditional AXI4-based multi-layer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.
