Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing
Alireza Khadem, Daichi Fujiki, Hilbert Chen, Yufeng Gu, Nishil Talati, Scott Mahlke, Reetuparna Das
TL;DR
This work addresses the underutilization of mobile vector hardware by extending long-vector ISA design to support multi-dimensional data layouts and memory accesses within the cache. The proposed MVE framework abstracts cache geometry, enabling multi-dimensional strided and random accesses with dimension-level masking, and couples a compute-capable in-cache cache architecture to a scalar core. Empirical results show MVE delivers about 2.9× speedup and 8.8× energy reduction versus a baseline mobile vector unit, while imposing only 3.6% area overhead; it also outperforms 1D RVV and maintains favorable characteristics against a mobile GPU for fine-grained data-parallel workloads. Collectively, MVE demonstrates a practical pathway to significantly elevate mobile in-cache computing performance through a general-purpose, multi-dimensional long-vector ISA and closely integrated cache design.
Abstract
In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension. Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE offers significant performance and energy reduction benefits of 2.9x and 8.8x, on average, compared to the SIMD units of a commercial mobile processor, at an area overhead of 3.6%.
