Table of Contents
Fetching ...

HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models

Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Joerg K. H. Franke, Frank Hutter

TL;DR

HW-GPT-Bench is introduced, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 1.55B parameters.

Abstract

The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 1.55B parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.

HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models

TL;DR

HW-GPT-Bench is introduced, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 1.55B parameters.

Abstract

The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 1.55B parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.
Paper Structure (50 sections, 13 equations, 72 figures, 11 tables)

This paper contains 50 sections, 13 equations, 72 figures, 11 tables.

Figures (72)

  • Figure 1: HW-GPT-Bench Overview. Illustration of the search space (left), hardware devices and metrics (middle) and multi-objective algorithms (right) used in the HW-GPT-Bench framework.
  • Figure 2: Empirical Cumulative distribution of different search space subspaces.
  • Figure 3: Trade-offs between Energy, Latency, and Perplexity across architectures for different search spaces. The blue curve represents the Pareto front obtained by randomly sampling an observation, while the best and worst possible Pareto fronts (red and black markers, respectively) are obtained by using the best and worst measured value, respectively, for latencies and energies.
  • Figure 4: Calibration area, Prediction Intervals and Confidence Bounds for different surrogate types on Xeon Silver CPU (Latency) and V100 (Energy). The rightmost plots show only the predictions and confidence bands of AutoGluon.
  • Figure 5: Feature ranking of architecture dimensions at different scales (lower rank is better). The embedding dimension (in red) is most important across scales and the number of layers (in yellow) is more important at smaller scales. MLP ratio (mr) and Number of heads (nh) at layer $N-1$ is important across different scales (depicted in blue and green).
  • ...and 67 more figures