Table of Contents
Fetching ...

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna

TL;DR

This paper benchmarks the effectiveness of this hardware architecture at accelerating LLM training and inference on the Cerebras Wafer Scale Engine, and examines the performance scalability of Cerebras WSE through a roofline model.

Abstract

Transformer based Large Language Models (LLMs) have recently reached state of the art performance in Natural Language Processing (NLP) and Computer Vision (CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to capture long-range global attention relationships among input words or image patches, drastically improving its performance over prior deep learning approaches. In this paper, we evaluate the performance of LLMs on the Cerebras Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros operations and its 40 GB of on-chip memory is uniformly distributed among SLAC cores, enabling fast local access to model parameters. Moreover, Cerebras software configures routing between cores at runtime, optimizing communication overhead among cores. As LLMs are becoming more commonly used, new hardware architectures are needed to accelerate LLMs training and inference. We benchmark the effectiveness of this hardware architecture at accelerating LLMs training and inference. Additionally, we analyze if Cerebras WSE can scale the memory-wall associated with traditionally memory-bound compute tasks using its 20 PB/s high bandwidth memory. Furthermore, we examine the performance scalability of Cerebras WSE through a roofline model. By plotting performance metrics against computational intensity, we aim to assess their effectiveness at handling high compute-intensive LLMs training and inference tasks.

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

TL;DR

This paper benchmarks the effectiveness of this hardware architecture at accelerating LLM training and inference on the Cerebras Wafer Scale Engine, and examines the performance scalability of Cerebras WSE through a roofline model.

Abstract

Transformer based Large Language Models (LLMs) have recently reached state of the art performance in Natural Language Processing (NLP) and Computer Vision (CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to capture long-range global attention relationships among input words or image patches, drastically improving its performance over prior deep learning approaches. In this paper, we evaluate the performance of LLMs on the Cerebras Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros operations and its 40 GB of on-chip memory is uniformly distributed among SLAC cores, enabling fast local access to model parameters. Moreover, Cerebras software configures routing between cores at runtime, optimizing communication overhead among cores. As LLMs are becoming more commonly used, new hardware architectures are needed to accelerate LLMs training and inference. We benchmark the effectiveness of this hardware architecture at accelerating LLMs training and inference. Additionally, we analyze if Cerebras WSE can scale the memory-wall associated with traditionally memory-bound compute tasks using its 20 PB/s high bandwidth memory. Furthermore, we examine the performance scalability of Cerebras WSE through a roofline model. By plotting performance metrics against computational intensity, we aim to assess their effectiveness at handling high compute-intensive LLMs training and inference tasks.
Paper Structure (17 sections, 4 equations, 8 figures, 5 tables)

This paper contains 17 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: GPT transformer decoder architecture. Input embeddings are first added with positional embeddings and then fed into GPT transformer-decoder block, where MHSA is performed.
  • Figure 2: BERT transformer encoder architecture. [class] token embedding is appended to the front of the input sequence. The BERT architecture is similar to the GPT model architecture shown in Figure \ref{['fig:gpt']}, except that MHSA occurs between the current token and tokens to the left and right. In BERT, the final output hidden state of [class] token is used to predict the probability distribution for classification of the input sequence.
  • Figure 3: Cerebras WSE architecture. Cores are connected in 2D mesh topology. Each core has a dedicated router that connects to neighboring cores and its own compute logic. Each core also has 48 KB SRAM, totaling 40 GB on-chip SRAM on the entire chip
  • Figure 4: BERT training throughput analysis. Training throughput is measured in samples/sec. Batch sizes are all powers of 2
  • Figure 5: Training throughput analysis of GPT-3 models over varying batch sizes. 5(a) shows the training throughput for GPT-3 2.7B, 6.7B, 13B, and 20B models. 5(b) shows training throughput for GPT-3 256M and 590M models. All batch sizes are measured in number of samples and are power of 2
  • ...and 3 more figures