Table of Contents
Fetching ...

MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication

Matteo Perotti, Yichao Zhang, Matheus Cavalcante, Enis Mustafa, Luca Benini

TL;DR

This paper tackles the energy-inefficient execution of dense matrix multiplication on edge-ready RISCV processors. It introduces MX, a lightweight, non-intrusive extension to the open-source RVV vector ISA that reuses existing vector resources to execute tiled matrix operations with a near-tile buffer and a broadcast mechanism. The approach yields up to $56\%$ performance gains and up to $25\%$ energy efficiency improvements for MatMul on 12-nm hardware, with a sub-3% area overhead and no clock frequency penalty, demonstrated on Dual-Core and 64-Core MemPool clusters. This work highlights a practical path to energy-efficient, matrix-oriented acceleration without specialized matrix units, leveraging software-transparent tiling and data-reuse optimizations.

Abstract

Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64x64x64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.

MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication

TL;DR

This paper tackles the energy-inefficient execution of dense matrix multiplication on edge-ready RISCV processors. It introduces MX, a lightweight, non-intrusive extension to the open-source RVV vector ISA that reuses existing vector resources to execute tiled matrix operations with a near-tile buffer and a broadcast mechanism. The approach yields up to performance gains and up to energy efficiency improvements for MatMul on 12-nm hardware, with a sub-3% area overhead and no clock frequency penalty, demonstrated on Dual-Core and 64-Core MemPool clusters. This work highlights a practical path to energy-efficient, matrix-oriented acceleration without specialized matrix units, leveraging software-transparent tiling and data-reuse optimizations.

Abstract

Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64x64x64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.
Paper Structure (27 sections, 4 equations, 3 figures, 4 tables)

This paper contains 27 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The tiling problem over a memory hierarchy composed of three levels, ending with the processing elements ().
  • Figure 2: Spatz's , , and with architectural schematic.
  • Figure 3: Power breakdown for Dual-Core (Left) and 64-Core clusters (Right) executing MatMul. Dual-Core: at TT@1GHz, executing non- (4 vectors, length 32) and -ready algorithms ($m'=8, n'=4, k'=4, B=4$). 64-Core: at TT@910MHz, executing non- (8 vectors, length 32) and -ready algorithms ($m'=8, n'=4, k'=8, B=8$).