Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures
Shiv Sundram, Akhilesh Balasingam, Nathan Zhang, Kunle Olukotun, Fredrik Kjolstad
TL;DR
Cyclotron tackles the data-movement bottleneck in modern hardware by expressing computations as recurrences over indexed tensors and lowering them to a space-time IR that captures both computation and communication. It provides a scheduling language to define dataflow (streams, broadcasts, prefetches) and two lowering passes that translate high-level recurrences into per-PE programs executable on a chiplet-style dataflow simulator (DAM) or an MPI-based multinode cluster. The approach unifies patterns such as Cannon, SUMMA, PUMMA, and TRSM/Cholesky/FlashAttention within a single compiler, enabling portable, architecture-aware optimizations and hardware-co-design exploration. Results show competitive performance with ScaLAPACK for distributed matrix multiplication and triangular solves, and demonstrate hardware-aware scheduling across simulated and real clusters. This work suggests a path toward a unified recurrence-based programming model for spatial computing, spanning on-chip dataflow to distributed memory systems.
Abstract
We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.
