Table of Contents
Fetching ...

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing

Alireza Khadem, Daichi Fujiki, Hilbert Chen, Yufeng Gu, Nishil Talati, Scott Mahlke, Reetuparna Das

TL;DR

This work addresses the underutilization of mobile vector hardware by extending long-vector ISA design to support multi-dimensional data layouts and memory accesses within the cache. The proposed MVE framework abstracts cache geometry, enabling multi-dimensional strided and random accesses with dimension-level masking, and couples a compute-capable in-cache cache architecture to a scalar core. Empirical results show MVE delivers about 2.9× speedup and 8.8× energy reduction versus a baseline mobile vector unit, while imposing only 3.6% area overhead; it also outperforms 1D RVV and maintains favorable characteristics against a mobile GPU for fine-grained data-parallel workloads. Collectively, MVE demonstrates a practical pathway to significantly elevate mobile in-cache computing performance through a general-purpose, multi-dimensional long-vector ISA and closely integrated cache design.

Abstract

In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension. Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE offers significant performance and energy reduction benefits of 2.9x and 8.8x, on average, compared to the SIMD units of a commercial mobile processor, at an area overhead of 3.6%.

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing

TL;DR

This work addresses the underutilization of mobile vector hardware by extending long-vector ISA design to support multi-dimensional data layouts and memory accesses within the cache. The proposed MVE framework abstracts cache geometry, enabling multi-dimensional strided and random accesses with dimension-level masking, and couples a compute-capable in-cache cache architecture to a scalar core. Empirical results show MVE delivers about 2.9× speedup and 8.8× energy reduction versus a baseline mobile vector unit, while imposing only 3.6% area overhead; it also outperforms 1D RVV and maintains favorable characteristics against a mobile GPU for fine-grained data-parallel workloads. Collectively, MVE demonstrates a practical pathway to significantly elevate mobile in-cache computing performance through a general-purpose, multi-dimensional long-vector ISA and closely integrated cache design.

Abstract

In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension. Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE offers significant performance and energy reduction benefits of 2.9x and 8.8x, on average, compared to the SIMD units of a commercial mobile processor, at an area overhead of 3.6%.
Paper Structure (34 sections, 2 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) Mobile core with in-cache computing enabled for half of the L2 cache. (b) In-SRAM computing activates two word-lines of an SRAM array using an extra row decoder. (c) Bit-Serial (blue), Bit-Hybrid, and Bit-Parallel (blue + orange) modifications to the bitline peripheral.
  • Figure 2: (a) MVE operates on N long-vector in-cache registers. (b) In-cache data elements and SIMD lanes use the vertical data layout of bit-lines. (c) An in-cache physical register spans all compute-capable SRAM arrays.
  • Figure 3: Strided memory access example of Intrapicture Prediction kernel: loading from (a) 2D memory layout to (b) 3D logical registers, mapped to (c) the SIMD lanes of flattened-out physical registers by MVE controller.
  • Figure 4: Random memory access of h2v2 Upsample kernel: loading from (a) random row pointers to (b) 4D logical registers. (c) shows the SIMD lanes of the flattened-out physical registers.
  • Figure 5: MVE Controller maps multi-dimensional logical registers to 1D Physical SIMD Registers. Efficient dimension-level masked execution masks off leaves under a node in the highest dimension of the tree (iterations of the outer-most loop).
  • ...and 9 more figures