Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
Mohammed Elbtity, Peyton Chandarana, Ramtin Zand
TL;DR
This paper addresses the throughput limits of static dataflows in TPUs by introducing Flex-TPU, a hardware design that enables runtime reconfiguration of the dataflow per layer. The architecture extends a conventional systolic array by adding one register and two multiplexers per processing element, enabling input-stationary, output-stationary, and weight-stationary modes under control of a Configuration Management Unit and supporting dataflow generation to optimize per-layer performance. Through cycle-accurate ScaleSim simulations and targeted synthesis, the authors demonstrate up to a 2.75× speedup over conventional TPUs with modest area and power overheads, and they show that the performance gains persist and grow at larger systolic-array scales up to $S=256×256$. The work highlights the practical potential of layer-wise dataflow optimization to boost throughput for diverse DNN workloads in both data-center and edge contexts, potentially shaping next-generation TPU designs.
Abstract
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
