Table of Contents
Fetching ...

LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space

Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, Vivienne Sze

TL;DR

The paper tackles the data-transfer bottleneck in DNN accelerators by advancing fused-layer dataflows that reuse inter-layer data in on-chip buffers. It introduces LoopTree, an analytical model that evaluates a broadened design space encompassing extensive tiling, retention, and recomputation choices, and provides a taxonomy to specification designs. Validations across multiple architectures show LoopTree achieves errors under 4% for key metrics, and case studies reveal substantial buffer-capacity reductions and nuanced trade-offs between on-chip memory, off-chip transfers, and recomputation. The work enables systematic exploration of fused-layer designs, offering practical insights for building low-latency, energy-efficient accelerators in real hardware.

Abstract

Latency and energy consumption are key metrics in the performance of deep neural network (DNN) accelerators. A significant factor contributing to latency and energy is data transfers. One method to reduce transfers or data is reusing data when multiple operations use the same data. Fused-layer accelerators reuse data across operations in different layers by retaining intermediate data in on-chip buffers, which has been shown to reduce energy consumption and latency. Moreover, the intermediate data is often tiled (i.e., broken into chunks) to reduce the on-chip buffer capacity required to reuse the data. Because on-chip buffer capacity is frequently more limited than computation units, fused-layer dataflow accelerators may also recompute certain parts of the intermediate data instead of retaining them in a buffer. Achieving efficient trade-offs between on-chip buffer capacity, off-chip transfers, and recomputation requires systematic exploration of the fused-layer dataflow design space. However, prior work only explored a subset of the design space, and more efficient designs are left unexplored. In this work, we propose (1) a more extensive design space that has more choices in terms of tiling, data retention, recomputation and, importantly, allows us to explore them in combination, (2) a taxonomy to systematically specify designs, and (3) a model, LoopTree, to evaluate the latency, energy consumption, buffer capacity requirements, and off-chip transfers of designs in this design space. We validate our model against a representative set of prior architectures, achieving a worst-case 4% error. Finally, we present case studies that show how exploring this larger space results in more efficient designs (e.g., up to a 10$\times$ buffer capacity reduction to achieve the same off-chip transfers).

LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space

TL;DR

The paper tackles the data-transfer bottleneck in DNN accelerators by advancing fused-layer dataflows that reuse inter-layer data in on-chip buffers. It introduces LoopTree, an analytical model that evaluates a broadened design space encompassing extensive tiling, retention, and recomputation choices, and provides a taxonomy to specification designs. Validations across multiple architectures show LoopTree achieves errors under 4% for key metrics, and case studies reveal substantial buffer-capacity reductions and nuanced trade-offs between on-chip memory, off-chip transfers, and recomputation. The work enables systematic exploration of fused-layer designs, offering practical insights for building low-latency, energy-efficient accelerators in real hardware.

Abstract

Latency and energy consumption are key metrics in the performance of deep neural network (DNN) accelerators. A significant factor contributing to latency and energy is data transfers. One method to reduce transfers or data is reusing data when multiple operations use the same data. Fused-layer accelerators reuse data across operations in different layers by retaining intermediate data in on-chip buffers, which has been shown to reduce energy consumption and latency. Moreover, the intermediate data is often tiled (i.e., broken into chunks) to reduce the on-chip buffer capacity required to reuse the data. Because on-chip buffer capacity is frequently more limited than computation units, fused-layer dataflow accelerators may also recompute certain parts of the intermediate data instead of retaining them in a buffer. Achieving efficient trade-offs between on-chip buffer capacity, off-chip transfers, and recomputation requires systematic exploration of the fused-layer dataflow design space. However, prior work only explored a subset of the design space, and more efficient designs are left unexplored. In this work, we propose (1) a more extensive design space that has more choices in terms of tiling, data retention, recomputation and, importantly, allows us to explore them in combination, (2) a taxonomy to systematically specify designs, and (3) a model, LoopTree, to evaluate the latency, energy consumption, buffer capacity requirements, and off-chip transfers of designs in this design space. We validate our model against a representative set of prior architectures, achieving a worst-case 4% error. Finally, we present case studies that show how exploring this larger space results in more efficient designs (e.g., up to a 10 buffer capacity reduction to achieve the same off-chip transfers).
Paper Structure (41 sections, 2 equations, 18 figures, 10 tables)

This paper contains 41 sections, 2 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Comparison of layer-by-layer and fused-layer dataflows. (a) Two layers (white boxes represent operations within the layers) and three fmaps. (b) Layer-by-layer processing produces all of Fmap2 before it is used. (c) Tiling layer operations and fmaps. (d) A fused-layer processing of Layer1 and Layer2, where only a tile of Fmap2 needs to be retained in a buffer at a time.
  • Figure 2: A 1D conv layer. All input channels ($C$) are used to generate an output fmap. Values ( i.e., activations) in the output fmap column ($P$) are generated by sliding the convolution window. Different output channels ($M$) are generated with different filters.
  • Figure 3: Examples of different tiling choices. (a) Tiles (darker shade) of Fmap1, Fmap2, and Fmap3 in iterations 0 and 1 in an output row ($P2$ rank) tiling. Note that Fmap2 tiles overlap using this tiling. (b) Tiles (darker shade) of Fmap1, Fmap2, and Fmap3 in iterations 0 and 1 in an intermediate-channel ($C2$ rank) tiling. Note that Fmap2 tiles do not overlap using this tiling.
  • Figure 4: The width (height is the same as width) and channels of layers in ResNet-18resnet (layers 1-5) and MobileNetv2 mobilenetv2 (layers 6-11) vary by orders of magnitude.
  • Figure 5: Partitioning rank $P2$ in Conv2 to create two Conv2 tiles, Tile0 and Tile1. (a) Given the the operation Conv2 Tile0, other data and operation tiles can be inferred. (b) The same is true for Tile1.
  • ...and 13 more figures