Table of Contents
Fetching ...

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu

TL;DR

FuseFlow presents a compiler that translates sparse PyTorch models into fused sparse dataflow graphs for reconfigurable dataflow architectures, enabling cross-expression kernel fusion (EKF) and partial fusion via a novel fusion-table IR and SAMML lowering. Implemented in MLIR and validated with cycle-accurate simulation, FuseFlow provides a scheduling interface for fusion granularity and dataflow ordering, plus a fast heuristic to prune suboptimal configurations. Across four model classes, it demonstrates that optimal fusion granularity is model-dependent, achieving up to ~2.7× speedups (e.g., GPT-3 with BigBird) and substantial gains for other models with partial fusion, while pruning ineffective schedules. These results underscore the importance of managing fusion scope in sparse ML on dataflow hardware and offer a practical path from PyTorch models to hardware-executable sparse dataflow graphs.

Abstract

As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

TL;DR

FuseFlow presents a compiler that translates sparse PyTorch models into fused sparse dataflow graphs for reconfigurable dataflow architectures, enabling cross-expression kernel fusion (EKF) and partial fusion via a novel fusion-table IR and SAMML lowering. Implemented in MLIR and validated with cycle-accurate simulation, FuseFlow provides a scheduling interface for fusion granularity and dataflow ordering, plus a fast heuristic to prune suboptimal configurations. Across four model classes, it demonstrates that optimal fusion granularity is model-dependent, achieving up to ~2.7× speedups (e.g., GPT-3 with BigBird) and substantial gains for other models with partial fusion, while pruning ineffective schedules. These results underscore the importance of managing fusion scope in sparse ML on dataflow hardware and offer a practical path from PyTorch models to hardware-executable sparse dataflow graphs.

Abstract

As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

Paper Structure

This paper contains 28 sections, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Log plot of SM and DRAM utilization (%) for PyG GCN inference on an RTX 5090 across five datasets.
  • Figure 2: SAM graph for sparse-matrix vector multiplication with $j \rightarrow i$ dataflow. Streams: solid grey = coordinate (crd), dashed grey = reference (ref), double black = value (val).
  • Figure 3: Dataflow diagrams for the forms of fusion, showing how they differ and are related.
  • Figure 4: Comparing fusion coverage and performance.
  • Figure 5: Two iteration patterns for $\forall_{ikjl} C_{il} \mathrel{+}= A_{ik}B_{kj}C_{jl}$, that are represented via loop nests with higher-order reduction variables highlighted in blue kjolstad2019workspaces. FuseFlow lowers to a dataflow input iteration graph with a factored iteration space (b), whereas prior work produces dataflow graphs with fully fused iteration spaces (a).
  • ...and 15 more figures