Table of Contents
Fetching ...

Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures

Shiv Sundram, Akhilesh Balasingam, Nathan Zhang, Kunle Olukotun, Fredrik Kjolstad

TL;DR

Cyclotron tackles the data-movement bottleneck in modern hardware by expressing computations as recurrences over indexed tensors and lowering them to a space-time IR that captures both computation and communication. It provides a scheduling language to define dataflow (streams, broadcasts, prefetches) and two lowering passes that translate high-level recurrences into per-PE programs executable on a chiplet-style dataflow simulator (DAM) or an MPI-based multinode cluster. The approach unifies patterns such as Cannon, SUMMA, PUMMA, and TRSM/Cholesky/FlashAttention within a single compiler, enabling portable, architecture-aware optimizations and hardware-co-design exploration. Results show competitive performance with ScaLAPACK for distributed matrix multiplication and triangular solves, and demonstrate hardware-aware scheduling across simulated and real clusters. This work suggests a path toward a unified recurrence-based programming model for spatial computing, spanning on-chip dataflow to distributed memory systems.

Abstract

We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.

Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures

TL;DR

Cyclotron tackles the data-movement bottleneck in modern hardware by expressing computations as recurrences over indexed tensors and lowering them to a space-time IR that captures both computation and communication. It provides a scheduling language to define dataflow (streams, broadcasts, prefetches) and two lowering passes that translate high-level recurrences into per-PE programs executable on a chiplet-style dataflow simulator (DAM) or an MPI-based multinode cluster. The approach unifies patterns such as Cannon, SUMMA, PUMMA, and TRSM/Cholesky/FlashAttention within a single compiler, enabling portable, architecture-aware optimizations and hardware-co-design exploration. Results show competitive performance with ScaLAPACK for distributed matrix multiplication and triangular solves, and demonstrate hardware-aware scheduling across simulated and real clusters. This work suggests a path toward a unified recurrence-based programming model for spatial computing, spanning on-chip dataflow to distributed memory systems.

Abstract

We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.

Paper Structure

This paper contains 24 sections, 40 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Left: Tensor algebra for matrix multiplication, which contains no dependencies between outputs. Right: Recurrence relation for forward triangular solve, which contains output dependencies in which $X_{ri}$ is dependent on $X_{rj}$ where $j<i$
  • Figure 2: Representative recurrences
  • Figure 3: Recurrence Grammar
  • Figure 4: Left: Cyclotron's IR for Cannon's algorithm style GEMM with communication and compute steps annotated. Each line specifies a logical dataflow recurrence over a multi-dimensional iteration space. In this IR, the loop index tuple$(i,j,k)$ is the primary entity. The tensor name appears as a subscript, indicating which array or value is being updated, and the iteration space name appears as a superscript, indicating which recurrence or computation this update belongs to. Loads to memory (e.g., $A_{ik}$) are specified in tensor index notation, which occur at boundaries of the iteration space (e.g., $(i,0,k)$.) Right: Visualization of an internal point of the iteration space
  • Figure 5: Cyclotron Overview. The system diagram shows the compilation flow for a output-stationary CANNON-style dataflow. By mapping $i$ and $j$ are mapped across space, $Cij$ is stationary, meaning it's locked resident onto each PE and does not move during the computation. $k$ is mapped implicitly across time, such that $A_{ik}$ streams systolically over $j$, and $B_{kj}$ is stramed over $i$. The IR can be executed on a data-flow simulator of a chiplet style architecture or on an MPI-enabled cluster of processors.
  • ...and 11 more figures