Table of Contents
Fetching ...

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon

Baraq Lipshitz, Alessio Melone, Charalampos Maraziaris, Muhammed Bilal

TL;DR

This paper addresses accelerating Sparse Ternary GEMM, a core operation when quantizing neural networks to a ternary set, on Apple Silicon. It introduces architecture-aware kernels for M-series CPUs, including a novel blocked and interleaved sparse data format to improve memory locality and ILP, together with NEON-based SIMD for data-level parallelism. The authors present both scalar and vector implementations, achieving up to $5.98\times$ speedup over a TCSC baseline at 50% sparsity (≈50.2% of peak) and up to $5.59\times$ speedup for 25% sparsity in the vector path, with robustness across sparsity. They observe that scalar code can outperform vectorized code on Apple Silicon due to the lack of gather/scatter support, and demonstrate practical gains for quantized ML workloads on Apple devices.

Abstract

Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon

TL;DR

This paper addresses accelerating Sparse Ternary GEMM, a core operation when quantizing neural networks to a ternary set, on Apple Silicon. It introduces architecture-aware kernels for M-series CPUs, including a novel blocked and interleaved sparse data format to improve memory locality and ILP, together with NEON-based SIMD for data-level parallelism. The authors present both scalar and vector implementations, achieving up to speedup over a TCSC baseline at 50% sparsity (≈50.2% of peak) and up to speedup for 25% sparsity in the vector path, with robustness across sparsity. They observe that scalar code can outperform vectorized code on Apple Silicon due to the lack of gather/scatter support, and demonstrate practical gains for quantized ML workloads on Apple devices.

Abstract

Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.

Paper Structure

This paper contains 5 sections, 2 equations, 11 figures.

Figures (11)

  • Figure 1: Example of TCSC format
  • Figure 2: For any $K\leq4096$, we obtain the same optimal unrolling factor.
  • Figure 3: As the size of the row increases, cache misses occur when the working set is 4 rows of Y and X.
  • Figure 4: For the largest K, only one row of X and Y fits in cache.
  • Figure 5: Example of BlockedTCSC format for B=2
  • ...and 6 more figures