Table of Contents
Fetching ...

SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction

Petros Toupas, Zhewen Yu, Christos-Savvas Bouganis, Dimitrios Tzovaras

TL;DR

This work tackles the memory bottleneck in streaming FPGA CNN accelerators by introducing activation eviction, weight fragmentation, and subgraph reconfiguration to enable mapping of modern architectures with long skip connections onto memory-constrained devices. It integrates these mechanisms into the fpgaConvNet toolflow and employs a greedy, iterative design-space exploration to optimize on-chip/off-chip memory usage and partitioning. Across 2D and 3D vision tasks, SMOF demonstrates competitive and, in several cases, state-of-the-art throughput, achieving up to $10.65\times$ improvements over previous approaches. By leveraging off-chip memory as a buffering layer and enabling flexible subgraph reconfiguration, the method broadens FPGA applicability to complex CNNs with limited on-chip resources.

Abstract

Convolutional Neural Networks (CNNs) have demonstrated their effectiveness in numerous vision tasks. However, their high processing requirements necessitate efficient hardware acceleration to meet the application's performance targets. In the space of FPGAs, streaming-based dataflow architectures are often adopted by users, as significant performance gains can be achieved through layer-wise pipelining and reduced off-chip memory access by retaining data on-chip. However, modern topologies, such as the UNet, YOLO, and X3D models, utilise long skip connections, requiring significant on-chip storage and thus limiting the performance achieved by such system architectures. The paper addresses the above limitation by introducing weight and activation eviction mechanisms to off-chip memory along the computational pipeline, taking into account the available compute and memory resources. The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer. This enables the mapping of such modern CNNs to devices with limited on-chip memory, under the streaming architecture design approach. SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks, achieving up to 10.65 X throughput improvement compared to previous works.

SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction

TL;DR

This work tackles the memory bottleneck in streaming FPGA CNN accelerators by introducing activation eviction, weight fragmentation, and subgraph reconfiguration to enable mapping of modern architectures with long skip connections onto memory-constrained devices. It integrates these mechanisms into the fpgaConvNet toolflow and employs a greedy, iterative design-space exploration to optimize on-chip/off-chip memory usage and partitioning. Across 2D and 3D vision tasks, SMOF demonstrates competitive and, in several cases, state-of-the-art throughput, achieving up to improvements over previous approaches. By leveraging off-chip memory as a buffering layer and enabling flexible subgraph reconfiguration, the method broadens FPGA applicability to complex CNNs with limited on-chip resources.

Abstract

Convolutional Neural Networks (CNNs) have demonstrated their effectiveness in numerous vision tasks. However, their high processing requirements necessitate efficient hardware acceleration to meet the application's performance targets. In the space of FPGAs, streaming-based dataflow architectures are often adopted by users, as significant performance gains can be achieved through layer-wise pipelining and reduced off-chip memory access by retaining data on-chip. However, modern topologies, such as the UNet, YOLO, and X3D models, utilise long skip connections, requiring significant on-chip storage and thus limiting the performance achieved by such system architectures. The paper addresses the above limitation by introducing weight and activation eviction mechanisms to off-chip memory along the computational pipeline, taking into account the available compute and memory resources. The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer. This enables the mapping of such modern CNNs to devices with limited on-chip memory, under the streaming architecture design approach. SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks, achieving up to 10.65 X throughput improvement compared to previous works.
Paper Structure (18 sections, 11 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Activation eviction. For the long skip connection, instead of being held at the on-chip buffer, the activation data is pushed to the off-chip memory. We support the lossless encoding schemes, such as RLE and Huffman, to save the off-chip bandwidth.
  • Figure 2: Weight fragmentation. The original weights with depth of $d$ is fragmented into static and dynamic regions. Weights in the dynamic regions are loaded from the off-chip memory at runtime, sharing the same piece of physical memory space in a time-multiplexed manner. The ratio of dynamic regions is denoted as $m$.
  • Figure 3: Graph manipulation strategies. Figure \ref{['fig:partitioning_spliting']} showcases the graph partitioning strategy into multiple subgraphs for different reconfiguration points. In Figure \ref{['fig:partitioning_off_chip_streaming']} the off-chip streaming strategy is presented, utilising the activation eviction method. By streaming the long skip connections to off-chip and reading back from it on merge points the graph can fit into the device without needing multiple partitions and hence device reconfiguration.
  • Figure 4: A streaming accelerator design example with the proposed memory optimisation methodology, where the weights of the layer Conv_22 are partially offloaded to off-chip, and the long skip connection between Relu_3 and Concat_47 is also evicted to off-chip.
  • Figure 5: A segment of the design's waveform illustrates the distinction between the initiation rate $r^{st}$, which is applicable to the pipeline depth region of the layers, and the standard input rate $r^{in}$, evident for the remainder of the layer's execution.
  • ...and 3 more figures