Table of Contents
Fetching ...

Revet: A Language and Compiler for Dataflow Threads

Alexander Rucker, Shiv Sundram, Coleman Smith, Matthew Vilim, Raghu Prabhakar, Fredrik Kjolstad, Kunle Olukotun

TL;DR

Revet tackles the challenge of running threaded, control-flow-rich applications on vectorized reconfigurable dataflow accelerators (vRDAs) by introducing a full stack: a programming language with explicit threaded parallelism, an MLIR-based compiler, and a streaming dataflow execution model. Central to Revet is the Structured-Link Tensor Format (SLTF) and a set of streaming primitives that encode control decisions as data, enabling hierarchical barriers and correct composition of nested loops. The work provides a concrete vRDA abstract machine, a compiler pipeline with front-end lowering, memory- and control-flow optimizations, and a CFG-to-dataflow lowering stage, validated on a variety of data-analytic and traversal workloads. Results show Revet consistently outperforms a state-of-the-art GPU (V100) by a geomean of $3.8\times$ on a $4.3\times$ smaller vRDA, with area-adjusted speedups around $16\times$, demonstrating the practicality and potential of thread-based programming for dataflow accelerators.

Abstract

Spatial dataflow architectures such as reconfigurable dataflow accelerators (RDA) can provide much higher performance and efficiency than CPUs and GPUs. In particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent literature represent a design point that enhances the efficiency of dataflow architectures with vectorization. Today, vRDAs can be exploited using either hardcoded kernels or MapReduce languages like Spatial, which cannot vectorize data-dependent control flow. In contrast, CPUs and GPUs can be programmed using general-purpose threaded abstractions. The ideal combination would be the generality of a threaded programming model coupled with the efficient execution model of a vRDA. We introduce Revet: a programming model, compiler, and execution model that lets threaded applications run efficiently on vRDAs. The Revet programming language uses threads to support a broader range of applications than Spatial's parallel patterns, and our MLIR-based compiler lowers this language to a generic dataflow backend that operates on streaming tensors. Finally, we show that mapping threads to dataflow outperforms GPUs, the current state-of-the-art for threaded accelerators, by 3.8x.

Revet: A Language and Compiler for Dataflow Threads

TL;DR

Revet tackles the challenge of running threaded, control-flow-rich applications on vectorized reconfigurable dataflow accelerators (vRDAs) by introducing a full stack: a programming language with explicit threaded parallelism, an MLIR-based compiler, and a streaming dataflow execution model. Central to Revet is the Structured-Link Tensor Format (SLTF) and a set of streaming primitives that encode control decisions as data, enabling hierarchical barriers and correct composition of nested loops. The work provides a concrete vRDA abstract machine, a compiler pipeline with front-end lowering, memory- and control-flow optimizations, and a CFG-to-dataflow lowering stage, validated on a variety of data-analytic and traversal workloads. Results show Revet consistently outperforms a state-of-the-art GPU (V100) by a geomean of on a smaller vRDA, with area-adjusted speedups around , demonstrating the practicality and potential of thread-based programming for dataflow accelerators.

Abstract

Spatial dataflow architectures such as reconfigurable dataflow accelerators (RDA) can provide much higher performance and efficiency than CPUs and GPUs. In particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent literature represent a design point that enhances the efficiency of dataflow architectures with vectorization. Today, vRDAs can be exploited using either hardcoded kernels or MapReduce languages like Spatial, which cannot vectorize data-dependent control flow. In contrast, CPUs and GPUs can be programmed using general-purpose threaded abstractions. The ideal combination would be the generality of a threaded programming model coupled with the efficient execution model of a vRDA. We introduce Revet: a programming model, compiler, and execution model that lets threaded applications run efficiently on vRDAs. The Revet programming language uses threads to support a broader range of applications than Spatial's parallel patterns, and our MLIR-based compiler lowers this language to a generic dataflow backend that operates on streaming tensors. Finally, we show that mapping threads to dataflow outperforms GPUs, the current state-of-the-art for threaded accelerators, by 3.8x.
Paper Structure (55 sections, 14 figures, 5 tables)

This paper contains 55 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: A diagram showing Revet's layout and vector/pipeline parallelism across functional units (FUs) within a compute unit vilim2021aurochs. For simplicity, only eight lanes are shown.
  • Figure 2: A loop: a 1-D tensor of threads is expanded into two dimensions and then contracted. Hierarchical-tensor and streaming-barrier (SLTF) views of data are shown. For simplicity, element-wise operations are elided. They could be added along any dataflow edge between complex primitives.
  • Figure 3: In a filter-merge operation (statement), a vector of threads is partitioned into two vectors, one for each branch. Here, link B is mapped as scalar to avoid overprovisioning network resources for a rare execution case. If links B and C were equally common, both could be mapped to vector dataflow resources at the cost of additional network congestion.
  • Figure 4: The operation of a forward-backward merge unit (loop) showing how threads iterate repeatedly. This figure shows a scalar entry, under the assumption that each dataflow thread entering on link A will traverse links B and C multiple times.
  • Figure 5: Above, Spatial koeplinger2018spatial requires that memory is explicitly transferred before the start of a parallel section. Below, Revet uses control flow to coordinate transfers without stalls.
  • ...and 9 more figures