CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms
Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clément Fournier, Lorenzo Chelini, Jeronimo Castrillon
TL;DR
CINM tackles the programmability challenge of heterogeneous compute-in-memory and compute-near-memory systems by delivering an end-to-end MLIR-based compilation flow that abstracts CIM and CNM devices through dedicated dialects and progressive lowering. It introduces the cinm dialect for target selection and device-agnostic IR, plus device-specific cim and cnm dialects and several device dialects to enable hardware-aware optimizations, tiling, and data-massage strategies. Empirical results on UPMEM CNM and memristor-based CIM show CINM can match or surpass hand-optimized implementations, with notable improvements in speed and energy for many kernels and substantial reductions in developer effort. The framework lays groundwork for scalable, adaptable compilation of future CIM/CNM targets, enabling more widespread adoption and continued evolution of heterogeneous memory-centric architectures.
Abstract
The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von- Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogenous, developing novel compiler abstractions and frameworks become necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchal abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for the well-known UPMEM CNM system and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.
