Table of Contents
Fetching ...

ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput

Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim

TL;DR

The paper addresses the mismatch between throughput-focused hardware and latency-sensitive QoS in LLM serving. It introduces ADOR, an automatic dataflow optimization and exploration framework that relies on a heterogeneous dataflow architecture template to balance latency and throughput. ADOR searches the design space across compute units, memory, NoC/P2P, and dynamic scheduling, backed by a simulator that predicts QoS metrics. Compared with GPUs and existing designs, ADOR delivers substantial gains, achieving up to $2.51\times$ better TBT and $4.01\times$ higher area efficiency (e.g., LLaMA3 70B on 8 devices), and demonstrates robust real-world QoS under dynamic batching. The approach offers a scalable, cost-effective path for future LLM serving across diverse models and workloads.

Abstract

The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.

ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput

TL;DR

The paper addresses the mismatch between throughput-focused hardware and latency-sensitive QoS in LLM serving. It introduces ADOR, an automatic dataflow optimization and exploration framework that relies on a heterogeneous dataflow architecture template to balance latency and throughput. ADOR searches the design space across compute units, memory, NoC/P2P, and dynamic scheduling, backed by a simulator that predicts QoS metrics. Compared with GPUs and existing designs, ADOR delivers substantial gains, achieving up to better TBT and higher area efficiency (e.g., LLaMA3 70B on 8 devices), and demonstrates robust real-world QoS under dynamic batching. The approach offers a scalable, cost-effective path for future LLM serving across diverse models and workloads.

Abstract

The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.

Paper Structure

This paper contains 24 sections, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Gap between end-user and vendor needs in LLM Serving. ADOR explores and proposes hardware architectures that consider both throughput and latency.
  • Figure 2: (a) Architecture of Large Language Models. (b) Variations in TTFT and TBT types based on batching methods. (c) The structure and differences between Coarse-Grained Reconfigurable Architecture (CGRA) and Heterogeneous Dataflow Architecture (HDA).
  • Figure 3: (a) Proportion of key-value cache size for various models. As batch size increases, the proportion of key-value size grows. (b) Proportion of Attention operations in various LLM models.
  • Figure 4: (a) Average computational performance per unit area for various hardware during the prefill stage of LLaMA3 8B. (b) Actual memory bandwidth utilization for various GenAI models. Both GPU and TPU show less than 60% utilization compared to their specifications.
  • Figure 5: Architectural and dataflow differences of three micro-architectures: (a) systolic array, (b) MAC tree, and (c) vector unit.
  • ...and 13 more figures