Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

Rachid Karami; Sheng-Chun Kao; Hyoukjun Kwon

Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

Rachid Karami, Sheng-Chun Kao, Hyoukjun Kwon

TL;DR

This work addresses the shifted performance horizon in ML inference where GEMM-optimized kernels no longer fully dominate latency. It introduces NonGEMM Bench, an open-source benchmark, to systematically profile non-GEMM operators across 17 widely used CV and NLP workloads on workstation and data-center platforms, with/without GPUs and under common deployment flows and quantization. The study finds that non-GEMM operators account for substantial and variable portions of end-to-end latency (11.3%–73.6% on average), rise notably after GEMM acceleration, and can be exacerbated by quantization due to dequantization/requantization overhead, while operator fusion only partially mitigates the bottleneck. The results argue for broad, non-GEMM-oriented optimization strategies and provide a practical benchmarking tool to drive future hardware and software innovations in this regime.

Abstract

Among ML operators today, GEneralMatrix Multiplication (GEMM)-based operators are known to be key operators that build the main backbone of ML models. As their computational overhead dominates the overall execution time (e.g., 42.8% - 96.6% in our results), GEMM operators have been the prime optimization targets for fast ML inference. This led to advanced GPUs and accelerators available today, which provided significant boost in the GEMM performance compared to CPUs, aligned with the lesson from Amdahl's law. However, accelerating GEMM has significantly shifted the Amdahl's law's landscape for ML inference; due to the decreased GEMM execution time, the relative execution time of non-GEMM operators is now significant. Although the importance of non-GEMM performance is increasing, we have little knowledge about the non-GEMM performance horizon in the latest hardware platforms and models. Therefore, to guide non-GEMM-oriented optimizations, we conduct a thorough performance analysis of 17 widely adopted ML models in Hugging Face and Torchvision on workstation and data center platforms with/without GPUs. We discover that non-GEMM performance bottleneck is a considerable issue across all the platforms and models, accounting for 11.3% to 73.6% of total latency, on average. The challenge significantly aggravates when we apply quantization, which is a common model compression technique, due to the boosted GEMM performance and extra non-GEMM operators for dequantization and requantization. To provide insights into non-GEMM optimization targets, we demystify the most dominant non-GEMM operators for each model and deployment software. We also show that widely adopted optimizations such as operator fusion do not completely address the non-GEMM performance bottleneck, where non-GEMM operators still account for 15% to 48% of total latency.

Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

TL;DR

Abstract

Paper Structure (31 sections, 9 figures, 5 tables)

This paper contains 31 sections, 9 figures, 5 tables.

Introduction
Background
ML Operators
ML Models and Popular Tasks
Performance Characterization Methodology
Models included in NonGEMM Bench
NonGEMM Bench Inputs
NonGEMM Bench Outputs
NonGEMM Bench Performance Characterization Flow
Case Studies
Non-GEMM Performance Characterization Results
The Impact of Deployment Flow on Non-GEMM Performance
The Impact of Quantization Non-GEMM Performance
Key Observations and Insights
Related Works
...and 16 more sections

Figures (9)

Figure 1: The latency breakdown into GEMM and non-GEMM operators on AMD EPYC 7763 + NVIDIA A100 GPU. We measure the latency on two popular models from HuggingFace (a) GPT2-XL (batch 1) achiam2023gpt and (b) Swin Transformer (batch 1) liu2021swin.
Figure 2: Descriptions of example non-GEMM and GEMM operators. (a) (Non-GEMM) non-maximum suppression he2017mask, (b) (GEMM) Conv1D (c) (Non-GEMM) Layer Normalization ba2016layernorm, and (d) (GEMM) Linear.
Figure 3: Architectures of three popular ML model families.
Figure 4: An overview of NonGEMM Bench flow.
Figure 5: End-to-End inference GPU energy consumption of models running on the Data Center (CPU + GPU) configuration.
...and 4 more figures

Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

TL;DR

Abstract

Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

Authors

TL;DR

Abstract

Table of Contents

Figures (9)