Table of Contents
Fetching ...

The Case for Co-Designing Model Architectures with Hardware

Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

TL;DR

The paper addresses the problem that transformer performance is strongly influenced by hardware-aware model shape, which is often overlooked in DL design. It adopts a hardware-first approach by mapping transformer operations to underlying GEMMs, analyzing GPU kernel behavior (tiling, wave quantization, and Tensor Core usage), and deriving practical guidelines to maximize throughput. The contributions include a comprehensive performance map from GEMMs to transformer blocks, a set of actionable design rules, and empirical demonstrations of throughput gains up to 39% with minimal/maintained accuracy via architectural tweaks such as parallel layers, FlashAttention, rotary embeddings, and SwiGLU-driven MLP scaling. The findings highlight the practical impact of hardware co-design for transformers, enabling more efficient training and inference across diverse GPU architectures and deployment scenarios.

Abstract

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.

The Case for Co-Designing Model Architectures with Hardware

TL;DR

The paper addresses the problem that transformer performance is strongly influenced by hardware-aware model shape, which is often overlooked in DL design. It adopts a hardware-first approach by mapping transformer operations to underlying GEMMs, analyzing GPU kernel behavior (tiling, wave quantization, and Tensor Core usage), and deriving practical guidelines to maximize throughput. The contributions include a comprehensive performance map from GEMMs to transformer blocks, a set of actionable design rules, and empirical demonstrations of throughput gains up to 39% with minimal/maintained accuracy via architectural tweaks such as parallel layers, FlashAttention, rotary embeddings, and SwiGLU-driven MLP scaling. The findings highlight the practical impact of hardware co-design for transformers, enabling more efficient training and inference across diverse GPU architectures and deployment scenarios.

Abstract

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
Paper Structure (31 sections, 3 equations, 47 figures, 3 tables)

This paper contains 31 sections, 3 equations, 47 figures, 3 tables.

Figures (47)

  • Figure 1: Transformer single-layer throughput of various architectures for a 2.7 billion parameter model (C1 and C2 are defined by this paper as C1: $h=2560, a=64$, C2: $h=2560, a=40$).
  • Figure 2: The proportion of latency from each transformer component for one layer of various model sizes
  • Figure 3: GEMM tiling GEMMguide.
  • Figure 4: The transformer architecture radford2019language.
  • Figure 5: Throughput (in teraFLOP/s) for matrix multiplication computations of various sizes.
  • ...and 42 more figures