Table of Contents
Fetching ...

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn

TL;DR

Duplex presents a unified device that mixes high-Op/B xPUs with low-Op/B Logic-PIM to accelerate large language models during continuous batching. By distributing MoE and grouped-query attention workloads across specialized processors and enabling expert and attention co-processing, it achieves significant throughput, latency, and energy improvements over GPU baselines and prior PIM approaches. The approach is validated through a cycle-accurate simulator with Mixtral, GLaM, Grok1, OPT, and Llama3 models, demonstrating advantages in modern MoE/GQA workloads. This work highlights the practicality of heterogeneous, memory-bandwidth-aware design for scalable LLM inference on a single device with shared memory.

Abstract

Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

TL;DR

Duplex presents a unified device that mixes high-Op/B xPUs with low-Op/B Logic-PIM to accelerate large language models during continuous batching. By distributing MoE and grouped-query attention workloads across specialized processors and enabling expert and attention co-processing, it achieves significant throughput, latency, and energy improvements over GPU baselines and prior PIM approaches. The approach is validated through a cycle-accurate simulator with Mixtral, GLaM, Grok1, OPT, and Llama3 models, demonstrating advantages in modern MoE/GQA workloads. This work highlights the practicality of heterogeneous, memory-bandwidth-aware design for scalable LLM inference on a single device with shared memory.

Abstract

Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.
Paper Structure (32 sections, 16 figures, 1 table)

This paper contains 32 sections, 16 figures, 1 table.

Figures (16)

  • Figure 1: LLM architecture and inference process in a gen stage with batched requests. Attention and FFN layers compose a conventional LLM, whereas GQA and MoE are used in place of these two layers, respectively.
  • Figure 2: (a) Baseline batching, which performs inference at the request level. (b) Continuous batching, which performs inference at the stage level. T2FT, TBT, and E2E latency values for request 2 are also detailed.
  • Figure 3: Model distribution methodology and operation flow of an LLM in a multi-node/multi-GPU system icml-2022-DSMoE. For non-expert weights, systems exploit tensor parallelism in the node, and data parallelism across nodes. For expert FFNs, the system allocates each expert FFN to a different GPU.
  • Figure 4: (a) Execution time ratio of each operation in Mixtral arxiv-2024-mixtral and GLaM icml-2022-GLaM varying $L_{out}$ and batch size while $L_{in}$ = 2048. Mixtral (GLaM) uses $deg_{grp}$$=4$$(1)$ for the attention layer and uses 8 (64) experts in the MoE layer with each token selecting the top-2 experts. (b) The roofline graph for each model on GPUs with varying batch sizes (32--128) when $L_{in}$ = 2048 and $L_{out}$ = 1024. Details of systems are in Section \ref{['sec:experimental_setup']}.
  • Figure 5: (a) The ratio of decoding-only stage to mixed stage in Mixtral on a GPU system. (b) The normalized latency of a heterogeneous system compared to a GPU system in Mixtral with a batch size of 32. The GPU system consists of four GPUs, while the heterogeneous system consists of two GPUs and two Logic-PIMs (details in Section \ref{['sec:duplex']}). (c) The normalized throughput of the heterogeneous system over the GPU system in Mixtral with a batch size of 128.
  • ...and 11 more figures