Table of Contents
Fetching ...

CIS: Composable Instruction Set for Data Streaming Applications

Yu Yang, Jordi Altayó González, Paul Delestrac, Ahmed Hemani

TL;DR

The paper addresses inefficiencies of conventional computation-centric ISAs for data-streaming workloads by introducing the Composable Instruction Set (CIS) that employs spatial and temporal composability to map static loops onto distributed hardware resources. It formalizes a hardware template with a sequencer and resource slots, and defines two temporal operators, REPETITION and TRANSITION, enabling nested loop structures to be composed from simple resource-centric instructions. A toy example and extensive discussion of architecture and compiler implications illustrate how CIS enables cooperative micro-threads to accelerate data streaming, including a 64-element vector operation implemented with four instructions. Experimental results on a DRRA platform show CIS achieves substantially higher effective PE utilization, approaching theoretical maximum for non-trivial workloads and outperforming traditional micro-architectures and parallel designs, highlighting CIS's potential for efficient, extensible data-streaming accelerators on heterogeneous hardware.

Abstract

The enhanced efficiency of hardware accelerators, including Single Instruction Multiple Data (SIMD) architectures and Coarse-Grained Reconfigurable Architectures (CGRAs), is driving significant advancements in Artificial Intelligence and Machine Learning (AI/ML) applications. These applications frequently involve data streaming operations comprised of numerous vector calculations inherently amenable to parallelization. However, despite considerable progress in hardware accelerator design, their potential remains constrained by conventional instruction set architectures (ISAs). Traditional ISAs, primarily designed for microprocessors and accelerators, emphasize computation while often neglecting instruction composability and inter-instruction cooperation. This limitation results in rigid ISAs that are difficult to extend and suffer from large control overhead in their hardware implementations. To address this, we present a novel composable instruction set (CIS) architecture, designed with both spatial and temporal composability, making it well-suited for data streaming applications. The proposed CIS utilizes a small instruction set, yet efficiently implements complex, multi-level loop structures essential for accelerating data streaming workloads. Furthermore, CIS adopts a resource-centric approach, facilitating straightforward extension through the integration of new hardware resources, enabling the creation of custom, heterogeneous computing platforms. Our results comparing performance between the proposed CIS and other state-of-the-art ISAs demonstrate that a CIS-based architecture significantly outperforms existing solutions, achieving near-optimal processing element (PE) utilization.

CIS: Composable Instruction Set for Data Streaming Applications

TL;DR

The paper addresses inefficiencies of conventional computation-centric ISAs for data-streaming workloads by introducing the Composable Instruction Set (CIS) that employs spatial and temporal composability to map static loops onto distributed hardware resources. It formalizes a hardware template with a sequencer and resource slots, and defines two temporal operators, REPETITION and TRANSITION, enabling nested loop structures to be composed from simple resource-centric instructions. A toy example and extensive discussion of architecture and compiler implications illustrate how CIS enables cooperative micro-threads to accelerate data streaming, including a 64-element vector operation implemented with four instructions. Experimental results on a DRRA platform show CIS achieves substantially higher effective PE utilization, approaching theoretical maximum for non-trivial workloads and outperforming traditional micro-architectures and parallel designs, highlighting CIS's potential for efficient, extensible data-streaming accelerators on heterogeneous hardware.

Abstract

The enhanced efficiency of hardware accelerators, including Single Instruction Multiple Data (SIMD) architectures and Coarse-Grained Reconfigurable Architectures (CGRAs), is driving significant advancements in Artificial Intelligence and Machine Learning (AI/ML) applications. These applications frequently involve data streaming operations comprised of numerous vector calculations inherently amenable to parallelization. However, despite considerable progress in hardware accelerator design, their potential remains constrained by conventional instruction set architectures (ISAs). Traditional ISAs, primarily designed for microprocessors and accelerators, emphasize computation while often neglecting instruction composability and inter-instruction cooperation. This limitation results in rigid ISAs that are difficult to extend and suffer from large control overhead in their hardware implementations. To address this, we present a novel composable instruction set (CIS) architecture, designed with both spatial and temporal composability, making it well-suited for data streaming applications. The proposed CIS utilizes a small instruction set, yet efficiently implements complex, multi-level loop structures essential for accelerating data streaming workloads. Furthermore, CIS adopts a resource-centric approach, facilitating straightforward extension through the integration of new hardware resources, enabling the creation of custom, heterogeneous computing platforms. Our results comparing performance between the proposed CIS and other state-of-the-art ISAs demonstrate that a CIS-based architecture significantly outperforms existing solutions, achieving near-optimal processing element (PE) utilization.
Paper Structure (13 sections, 1 equation, 5 figures, 2 tables)

This paper contains 13 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The hardware architecture is a template that consists of a sequencer and multiple resource slots. The hardware architecture instance will be used to demonstrate later examples. Colored slots are occupied by resources, but gray slots are unused. Resources can occupy multiple continuous slots, hence accessing multiple ports.
  • Figure 2: The decomposition of program to operations
  • Figure 3: The effective PE utilization comparison: DRRA vs RISC-V
  • Figure 4: The effective PE utilization comparison: DRRA vs ARA-2, OpenEdge CGRA and TI C7000 VLIW
  • Figure 5: The effect of parallelism on both DRRA and CGRA