Table of Contents
Fetching ...

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Mohammed Elbtity, Peyton Chandarana, Ramtin Zand

TL;DR

This paper addresses the throughput limits of static dataflows in TPUs by introducing Flex-TPU, a hardware design that enables runtime reconfiguration of the dataflow per layer. The architecture extends a conventional systolic array by adding one register and two multiplexers per processing element, enabling input-stationary, output-stationary, and weight-stationary modes under control of a Configuration Management Unit and supporting dataflow generation to optimize per-layer performance. Through cycle-accurate ScaleSim simulations and targeted synthesis, the authors demonstrate up to a 2.75× speedup over conventional TPUs with modest area and power overheads, and they show that the performance gains persist and grow at larger systolic-array scales up to $S=256×256$. The work highlights the practical potential of layer-wise dataflow optimization to boost throughput for diverse DNN workloads in both data-center and edge contexts, potentially shaping next-generation TPU designs.

Abstract

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

TL;DR

This paper addresses the throughput limits of static dataflows in TPUs by introducing Flex-TPU, a hardware design that enables runtime reconfiguration of the dataflow per layer. The architecture extends a conventional systolic array by adding one register and two multiplexers per processing element, enabling input-stationary, output-stationary, and weight-stationary modes under control of a Configuration Management Unit and supporting dataflow generation to optimize per-layer performance. Through cycle-accurate ScaleSim simulations and targeted synthesis, the authors demonstrate up to a 2.75× speedup over conventional TPUs with modest area and power overheads, and they show that the performance gains persist and grow at larger systolic-array scales up to . The work highlights the practical potential of layer-wise dataflow optimization to boost throughput for diverse DNN workloads in both data-center and edge contexts, potentially shaping next-generation TPU designs.

Abstract

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
Paper Structure (7 sections, 7 figures, 2 tables)

This paper contains 7 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Cycles required for executing each layer in ResNet-18 model using static dataflow architectures: (a) input stationary, (b) output Stationary, and (c) weight stationary. The layer-wise comparison shows that the optimal dataflow can be different in each layer of the network emphasizing the potential benefits of a flexible TPU with a run-time reconfigurable dataflow.
  • Figure 2: The proposed Flex-TPU Architecture.
  • Figure 3: The proposed Flex-TPU processing element (PE) with runtime reconfigurable dataflow.
  • Figure 4: The three flexible PE dataflow configurations controlled by the two added MUXs: (a) IS, (b) OS, and (c) WS modes.
  • Figure 5: The layout of the in-house designed TPU chip exhibiting the ratio of the systolic array compared to the surrounding logic and controller.
  • ...and 2 more figures