METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators
Zhao Wang, Jingchen Zhu, Zhe Zhou, Guangyu Sun
TL;DR
METRO tackles the inter-tile communication bottleneck in tiled spatial DNN accelerators by moving traffic scheduling from hardware to software, enabling global optimization of NoC traffic. It introduces a software scheduling framework and a customized NoC, coupled with a design-space formulation and an evolutionary search to select effective scheduling strategies, together with an Injection Time Control mechanism and hybrid routing to mitigate congestion. Key contributions include formalizing the traffic design space for tensor applications, NoC design, mapping, and routing, plus header-compression via chunk-based packets to reduce overhead and a detailed evaluation showing substantial performance gains. The work demonstrates a practical path to scalable, energy-efficient spatial accelerators by integrating software-level traffic policies with hardware support.
Abstract
Tiled spatial architectures have proved to be an effective solution to build large-scale DNN accelerators. In particular, interconnections between tiles are critical for high performance in these tile-based architectures. In this work, we identify the inefficiency of the widely used traditional on-chip networks and the opportunity of software-hardware co-design. We propose METRO with the basic idea of decoupling the traffic scheduling policies from hardware fabrics and moving them to the software level. METRO contains two modules working in synergy: METRO software scheduling framework to coordinate the traffics and METRO hardware facilities to deliver the data based on software configurations. We evaluate the co-design using different flit sizes for synthetic study, illustrating its effectiveness under various hardware resource constraints, in addition to a wide range of DNN models selected from real-world workloads. The results show that METRO achieves 56.3% communication speedup on average and up to 73.6% overall processing time reduction compared with traditional on-chip network designs.
