Evaluation of Domain-Specific Architectures for General-Purpose Applications in Apple Silicon
Álvaro Corrochano López, Carlos García Sánchez
TL;DR
This paper evaluates whether Apple's ANE can extend beyond ML inference to general-purpose HPC workloads on Apple Silicon by benchmarking AI models ($GEMM$, $Jacobi$, $Multigrid$) on M1 and M4 Pro. The authors migrate and test kernels as well as AI models (YOLOv3/v11) using CoreMLTools to compare CPU, GPU, and ANE performance and energy, revealing up to $3.8$ TFLOPS on the M4 Pro for GEMM and strong energy efficiency on ANE for several workloads. Key findings show ANE often delivering best energy efficiency and competitive compute for small-scale tasks, while GPU and CPU scale more robustly across larger problem sizes, with memory limits constraining ANE at scale. The work demonstrates the potential and limitations of ANE for general-purpose HPC on Apple Silicon and suggests hybrid strategies (e.g., split Multigrid workloads) as future directions to maximize performance and energy savings in heterogeneous SoCs.
Abstract
The rise of AI and its growing computational demands have driven the integration of domain-specific accelerators (such as GPUs, TPUs, and NPUs) across the entire computing infrastructure. Following the precedent set by the GPGPU which popularized GPUs for general-purpose tasks, this research asks whether this phenomenon can be replicated with specialized accelerators like NPUs in new contexts. This paper evaluates the potential of the Apple Neural Engine (ANE) designed for high energy efficiency in Machine Learning workloads, in the context of general-purpose HPC applications. We evaluate the performance and energy consumption of classic HPC algorithms such as GEMM, Jacobi or Multigrid methods on Apple's ANE across the M1 and the latest M4 architectures. Results confirm that, when algorithms are properly adapted, the ANE achieves competitive performance (up to 3.8 TFlops on the M4-Pro, comparable to the GPU's 4.7 TFlops on the same SoC for GEMM operation) while demonstrating significantly superior energy efficiency (e.g., GEMM consumes 5.2 Watts on the ANE versus 24 Watts on GPU counterpart in M4 architectures).
