ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput
Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim
TL;DR
The paper addresses the mismatch between throughput-focused hardware and latency-sensitive QoS in LLM serving. It introduces ADOR, an automatic dataflow optimization and exploration framework that relies on a heterogeneous dataflow architecture template to balance latency and throughput. ADOR searches the design space across compute units, memory, NoC/P2P, and dynamic scheduling, backed by a simulator that predicts QoS metrics. Compared with GPUs and existing designs, ADOR delivers substantial gains, achieving up to $2.51\times$ better TBT and $4.01\times$ higher area efficiency (e.g., LLaMA3 70B on 8 devices), and demonstrates robust real-world QoS under dynamic batching. The approach offers a scalable, cost-effective path for future LLM serving across diverse models and workloads.
Abstract
The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.
