CIS: Composable Instruction Set for Data Streaming Applications
Yu Yang, Jordi Altayó González, Paul Delestrac, Ahmed Hemani
TL;DR
The paper addresses inefficiencies of conventional computation-centric ISAs for data-streaming workloads by introducing the Composable Instruction Set (CIS) that employs spatial and temporal composability to map static loops onto distributed hardware resources. It formalizes a hardware template with a sequencer and resource slots, and defines two temporal operators, REPETITION and TRANSITION, enabling nested loop structures to be composed from simple resource-centric instructions. A toy example and extensive discussion of architecture and compiler implications illustrate how CIS enables cooperative micro-threads to accelerate data streaming, including a 64-element vector operation implemented with four instructions. Experimental results on a DRRA platform show CIS achieves substantially higher effective PE utilization, approaching theoretical maximum for non-trivial workloads and outperforming traditional micro-architectures and parallel designs, highlighting CIS's potential for efficient, extensible data-streaming accelerators on heterogeneous hardware.
Abstract
The enhanced efficiency of hardware accelerators, including Single Instruction Multiple Data (SIMD) architectures and Coarse-Grained Reconfigurable Architectures (CGRAs), is driving significant advancements in Artificial Intelligence and Machine Learning (AI/ML) applications. These applications frequently involve data streaming operations comprised of numerous vector calculations inherently amenable to parallelization. However, despite considerable progress in hardware accelerator design, their potential remains constrained by conventional instruction set architectures (ISAs). Traditional ISAs, primarily designed for microprocessors and accelerators, emphasize computation while often neglecting instruction composability and inter-instruction cooperation. This limitation results in rigid ISAs that are difficult to extend and suffer from large control overhead in their hardware implementations. To address this, we present a novel composable instruction set (CIS) architecture, designed with both spatial and temporal composability, making it well-suited for data streaming applications. The proposed CIS utilizes a small instruction set, yet efficiently implements complex, multi-level loop structures essential for accelerating data streaming workloads. Furthermore, CIS adopts a resource-centric approach, facilitating straightforward extension through the integration of new hardware resources, enabling the creation of custom, heterogeneous computing platforms. Our results comparing performance between the proposed CIS and other state-of-the-art ISAs demonstrate that a CIS-based architecture significantly outperforms existing solutions, achieving near-optimal processing element (PE) utilization.
