Table of Contents
Fetching ...

Performance Characterization of Expert Router for Scalable LLM Inference

Josef Pichlmeier, Philipp Ross, Andre Luckow

TL;DR

The findings reveal that Expert Router introduces minimal latency overhead, with the configuration of expert models being a dominating factor in performance outcomes, and highlights the potential of Expert Router for efficient and scalable LLM deployment.

Abstract

Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of LLMs' high computational and memory demands. Specialized models optimized for specific tasks can be combined through a routing mechanism to address these challenges, creating a modular inference system. This paper introduces Expert Router, a scalable routing architecture that directs prompts to specialized expert models. We characterize multiple Expert Router configurations, including different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users. Our findings reveal that Expert Router introduces minimal latency overhead, with the configuration of expert models being a dominating factor in performance outcomes. High-parameter expert models deliver stable throughput and latency under moderate concurrency levels. In contrast, smaller expert models maintain competitive performance across a wider range of concurrent users compared to tensor-parallelized baseline models. This highlights the potential of Expert Router for efficient and scalable LLM deployment.

Performance Characterization of Expert Router for Scalable LLM Inference

TL;DR

The findings reveal that Expert Router introduces minimal latency overhead, with the configuration of expert models being a dominating factor in performance outcomes, and highlights the potential of Expert Router for efficient and scalable LLM deployment.

Abstract

Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of LLMs' high computational and memory demands. Specialized models optimized for specific tasks can be combined through a routing mechanism to address these challenges, creating a modular inference system. This paper introduces Expert Router, a scalable routing architecture that directs prompts to specialized expert models. We characterize multiple Expert Router configurations, including different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users. Our findings reveal that Expert Router introduces minimal latency overhead, with the configuration of expert models being a dominating factor in performance outcomes. High-parameter expert models deliver stable throughput and latency under moderate concurrency levels. In contrast, smaller expert models maintain competitive performance across a wider range of concurrent users compared to tensor-parallelized baseline models. This highlights the potential of Expert Router for efficient and scalable LLM deployment.
Paper Structure (17 sections, 7 figures, 1 table)

This paper contains 17 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Expert router architecture and experimental setup: The architecture comprises three main components: the routing gateway, the Triton inference server, and the user simulator. Incoming prompts (1) are classified by the routing gateway using a k-means algorithm (2 + 3) and forwarded to the corresponding language model (4). All models run on different GPUs independently and can process queries concurrently. The inference responses are returned to the respective users (5 + 6). The clustering algorithm has been trained on a set of samples indicated by the dots, and new prompts are classified according to the respective topics indicated by the colors.
  • Figure 2: Distribution of number of input tokens. The plot shows the distribution of the number of input tokens in the test set. Its shape is based on the input length distribution used by MLPerf in the Llama 2 70B benchmark test atta-fosu_llama_2024.
  • Figure 3: Preliminary Experiments: The two plots show the trajectories of the response times with increasing user concurrency for different batch sizes and data types. The results on the left side use a LLama 3 70B model parallelized over 4 GPUs (TP=4) and on the right side across 8 GPUs (TP=8). Based on these results we select the three baseline models listed in Table \ref{['tab: Model_facts']}.
  • Figure 4: Median Time to First Token: Baseline models (A,B,C) show lower TTFT values compared to 70B Expert Router model (D) due to the benefit of tensor-parallelism in the prefill phase. The 8B Expert Router configuration (E) shows similar TTFT to tensor-parallized models due to reduced computational demands. The whiskers on the right box plot show that the Expert Router models have higher minimum values, highlighting the extra latency from the routing gateway.
  • Figure 5: Median Time per Output Token: The 8B Expert Router configuration (E) maintains the lowest TPOT due to its smaller number of parameters. Baseline model C shows higher TPOT due to increased communication overhead with tensor parallelism.
  • ...and 2 more figures