Table of Contents
Fetching ...

Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication

Mohammadreza Esmali Nojehdeh, Hossein Mokhtarnia, Julian Pavon Rivera, Narcis Rodas Quiroga, Roger Figueras Bagué, Enrico Reggiani, Miquel Moreto, Osman Unsal, Adrian Cristal, Eduard Ayguade

TL;DR

The paper tackles the bottleneck of efficient quantized matrix multiplication on vector architectures by introducing the Cartesian Accumulative Matrix Pipeline (CAMP), a hybrid-multiplier, outer-product–oriented micro-architecture. By extending ARM SVE and edge-RISC-V with a dedicated CAMP instruction and supporting hardware, CAMP achieves substantial speedups (up to $17\times$–$23\times$) and energy savings (over $80\%$) for 8-bit and 4-bit GeMM across LLMs and CNNs, while incurring minimal area overhead ($1\%$ on A64FX-class cores and $4\%$ on RISCV edge SoCs). The solution hinges on a divide-and-conquer hybrid multiplier, intra-/inter-lane accumulators, and an outer-product computation model that reduces data movement and instruction count compared to traditional BLAS-based software on vector units. This work demonstrates a practical, scalable path to natively support low-precision workloads in future vector extensions, with clear implications for edge and datacenter AI accelerators.

Abstract

This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V SIMD-based architectures. In addition to increasing throughput, CAMP's architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17$\times$ and 23$\times$ performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture using Synopsys tools -- targeting ARM TSMC 7nm for A64FX and GlobalFoundries 22nm for the RISC-V SoC -- add only 1\% and 4\% area overhead, respectively, compared to the baseline designs.

Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication

TL;DR

The paper tackles the bottleneck of efficient quantized matrix multiplication on vector architectures by introducing the Cartesian Accumulative Matrix Pipeline (CAMP), a hybrid-multiplier, outer-product–oriented micro-architecture. By extending ARM SVE and edge-RISC-V with a dedicated CAMP instruction and supporting hardware, CAMP achieves substantial speedups (up to ) and energy savings (over ) for 8-bit and 4-bit GeMM across LLMs and CNNs, while incurring minimal area overhead ( on A64FX-class cores and on RISCV edge SoCs). The solution hinges on a divide-and-conquer hybrid multiplier, intra-/inter-lane accumulators, and an outer-product computation model that reduces data movement and instruction count compared to traditional BLAS-based software on vector units. This work demonstrates a practical, scalable path to natively support low-precision workloads in future vector extensions, with clear implications for edge and datacenter AI accelerators.

Abstract

This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V SIMD-based architectures. In addition to increasing throughput, CAMP's architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17 and 23 performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture using Synopsys tools -- targeting ARM TSMC 7nm for A64FX and GlobalFoundries 22nm for the RISC-V SoC -- add only 1\% and 4\% area overhead, respectively, compared to the baseline designs.

Paper Structure

This paper contains 21 sections, 2 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Cache Miss Rate (CMR) for matrix multiplication using naive and ulmBLAS methods on square matrices and ResNet layers, evaluated on the A64FX core.
  • Figure 2: C++ Code for matrix multiplication using the GotoBLAS micro-kernel.
  • Figure 3: The GotoBLAS algorithm for matrix-matrix multiplicationvan2017implementing.
  • Figure 4: Functional unit busy rate by method and number of operations.
  • Figure 5: Structure of hybrid multiplier for 4n-bit multiplication using n-bit building blocks.
  • ...and 13 more figures