FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

Rubens Lacouture; Nathan Zhang; Ritvik Sharma; Marco Siracusa; Fredrik Kjolstad; Kunle Olukotun; Olivia Hsu

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu

TL;DR

FuseFlow presents a compiler that translates sparse PyTorch models into fused sparse dataflow graphs for reconfigurable dataflow architectures, enabling cross-expression kernel fusion (EKF) and partial fusion via a novel fusion-table IR and SAMML lowering. Implemented in MLIR and validated with cycle-accurate simulation, FuseFlow provides a scheduling interface for fusion granularity and dataflow ordering, plus a fast heuristic to prune suboptimal configurations. Across four model classes, it demonstrates that optimal fusion granularity is model-dependent, achieving up to ~2.7× speedups (e.g., GPT-3 with BigBird) and substantial gains for other models with partial fusion, while pruning ineffective schedules. These results underscore the importance of managing fusion scope in sparse ML on dataflow hardware and offer a practical path from PyTorch models to hardware-executable sparse dataflow graphs.

Abstract

As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

TL;DR

Abstract

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)