Table of Contents
Fetching ...

MINISA: Minimal Instruction Set Architecture for Next-gen Reconfigurable Inference Accelerator

Jianming Tong, Devansh Jain, Yujie Li, Charith Mendis, Tushar Krishna

Abstract

Modern reconfigurable AI accelerators rely on rich mapping and data-layout flexibility to sustain high utilization across matrix multiplication, convolution, and emerging applications beyond AI. However, exposing this flexibility through fine-grained micro-control results in prohibitive control overhead of fetching configuration bits from off-chip memory. This paper presents MINISA, a minimal instruction set that programs a reconfigurable accelerator at the granularity of Virtual Neurons (VNs), the coarsest control granularity that retains flexibility of hardware and the finest granularity that avoids unnecessary control costs. First, we introduce FEATHER+, a modest refinement of FEATHER, that eliminates redundant on-chip replication needed for runtime dataflow/layout co-switching and supports dynamic cases where input and weight data are unavailable before execution for offline layout manipulation. MINISA then abstracts control of FEATHER+ into three layout-setting instructions for input, weight, and output VNs and a single mapping instruction for setting dataflow. This reduces the control and instruction footprint while preserving the legal mapping and layout space supported by the FEATHER+. Our results show that MINISA reduces geometric mean off-chip instruction traffic by factors ranging from 35x to (4x10^5)x under various sizes under 50 GEMM workloads spanning AI (GPT-oss), FHE, and ZKP. This eliminates instruction-fetch stalls that consume 96.9% of micro-instruction cycles, yielding up to 31.6x end-to-end speedup for 16x256 FEATHER+. Our code: https://github.com/maeri-project/FEATHER/tree/main/minisa.

MINISA: Minimal Instruction Set Architecture for Next-gen Reconfigurable Inference Accelerator

Abstract

Modern reconfigurable AI accelerators rely on rich mapping and data-layout flexibility to sustain high utilization across matrix multiplication, convolution, and emerging applications beyond AI. However, exposing this flexibility through fine-grained micro-control results in prohibitive control overhead of fetching configuration bits from off-chip memory. This paper presents MINISA, a minimal instruction set that programs a reconfigurable accelerator at the granularity of Virtual Neurons (VNs), the coarsest control granularity that retains flexibility of hardware and the finest granularity that avoids unnecessary control costs. First, we introduce FEATHER+, a modest refinement of FEATHER, that eliminates redundant on-chip replication needed for runtime dataflow/layout co-switching and supports dynamic cases where input and weight data are unavailable before execution for offline layout manipulation. MINISA then abstracts control of FEATHER+ into three layout-setting instructions for input, weight, and output VNs and a single mapping instruction for setting dataflow. This reduces the control and instruction footprint while preserving the legal mapping and layout space supported by the FEATHER+. Our results show that MINISA reduces geometric mean off-chip instruction traffic by factors ranging from 35x to (4x10^5)x under various sizes under 50 GEMM workloads spanning AI (GPT-oss), FHE, and ZKP. This eliminates instruction-fetch stalls that consume 96.9% of micro-instruction cycles, yielding up to 31.6x end-to-end speedup for 16x256 FEATHER+. Our code: https://github.com/maeri-project/FEATHER/tree/main/minisa.
Paper Structure (79 sections, 11 equations, 13 figures, 7 tables)

This paper contains 79 sections, 11 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Workload Illustration. Convolution is converted to MatMul via im2col. We use color coding: blue/green/purple/red for Input($I$)/Weight($W$)/Partial-sum(psum, $P$)/Outputs($O$).
  • Figure 2: Programmer view of FEATHER+, the reconfigurable accelerator capable of co-switching both mapping and layout.
  • Figure 3: Execute* Field definition for $AH\!\times\! AW$ NEST.
  • Figure 4: ExecuteMapping examples for $4\!\times\!4$ NEST.
  • Figure 5: MINISA specifications for layout, load/store, and dataflow. $D$ refers to the depth of stationary / streaming buffer.
  • ...and 8 more figures