Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Sungmin Yun; Kwanhee Kyung; Juhwan Cho; Jaewan Choi; Jongmin Kim; Byeongho Kim; Sukhan Lee; Kyomin Sohn; Jung Ho Ahn

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn

TL;DR

Duplex presents a unified device that mixes high-Op/B xPUs with low-Op/B Logic-PIM to accelerate large language models during continuous batching. By distributing MoE and grouped-query attention workloads across specialized processors and enabling expert and attention co-processing, it achieves significant throughput, latency, and energy improvements over GPU baselines and prior PIM approaches. The approach is validated through a cycle-accurate simulator with Mixtral, GLaM, Grok1, OPT, and Llama3 models, demonstrating advantages in modern MoE/GQA workloads. This work highlights the practicality of heterogeneous, memory-bandwidth-aware design for scalable LLM inference on a single device with shared memory.

Abstract

Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

TL;DR

Abstract

Paper Structure (32 sections, 16 figures, 1 table)

This paper contains 32 sections, 16 figures, 1 table.

Introduction
Background
Structure of Large Language Models (LLMs)
Mixture of Experts and Grouped-Query Attention
LLM Inference with Continuous Batching
High Bandwidth Memory (HBM)
Computational Analysis
Computational Analysis of MoE and Attention Layers
Limitations of Heterogeneous Systems
Duplex: Devices for Efficient LLM Inference
Implementation of High Op/B Processors
Implementation of Low Op/B Processors
Microarchitecture of Logic-PIM
Duplex Architecture
Comparing Duplex with prior PIM architecture
...and 17 more sections

Figures (16)

Figure 1: LLM architecture and inference process in a gen stage with batched requests. Attention and FFN layers compose a conventional LLM, whereas GQA and MoE are used in place of these two layers, respectively.
Figure 2: (a) Baseline batching, which performs inference at the request level. (b) Continuous batching, which performs inference at the stage level. T2FT, TBT, and E2E latency values for request 2 are also detailed.
Figure 3: Model distribution methodology and operation flow of an LLM in a multi-node/multi-GPU system icml-2022-DSMoE. For non-expert weights, systems exploit tensor parallelism in the node, and data parallelism across nodes. For expert FFNs, the system allocates each expert FFN to a different GPU.
Figure 4: (a) Execution time ratio of each operation in Mixtral arxiv-2024-mixtral and GLaM icml-2022-GLaM varying $L_{out}$ and batch size while $L_{in}$ = 2048. Mixtral (GLaM) uses $deg_{grp}$$=4$$(1)$ for the attention layer and uses 8 (64) experts in the MoE layer with each token selecting the top-2 experts. (b) The roofline graph for each model on GPUs with varying batch sizes (32--128) when $L_{in}$ = 2048 and $L_{out}$ = 1024. Details of systems are in Section \ref{['sec:experimental_setup']}.
Figure 5: (a) The ratio of decoding-only stage to mixed stage in Mixtral on a GPU system. (b) The normalized latency of a heterogeneous system compared to a GPU system in Mixtral with a batch size of 32. The GPU system consists of four GPUs, while the heterogeneous system consists of two GPUs and two Logic-PIMs (details in Section \ref{['sec:duplex']}). (c) The normalized throughput of the heterogeneous system over the GPU system in Mixtral with a batch size of 128.
...and 11 more figures

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

TL;DR

Abstract

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Authors

TL;DR

Abstract

Table of Contents

Figures (16)