Table of Contents
Fetching ...

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos

TL;DR

Simulated evaluations show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively, and is inspired by the cache data management.

Abstract

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive optimizations for optimal performance. Current systems batch prefill and decoding to boost throughput but encounter latency issues, while others disaggregate these phases, leading to resource underutilization. We propose AcceLLM, a novel method addressing latency and load balancing, inspired by the cache data management. It strategically utilizes redundant data to enhance inference via load balancing and optimal hardware use. Simulated evaluations on Nvidia H100 GPU and Huawei Ascend 910B2 show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively.

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

TL;DR

Simulated evaluations show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively, and is inspired by the cache data management.

Abstract

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive optimizations for optimal performance. Current systems batch prefill and decoding to boost throughput but encounter latency issues, while others disaggregate these phases, leading to resource underutilization. We propose AcceLLM, a novel method addressing latency and load balancing, inspired by the cache data management. It strategically utilizes redundant data to enhance inference via load balancing and optimal hardware use. Simulated evaluations on Nvidia H100 GPU and Huawei Ascend 910B2 show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively.

Paper Structure

This paper contains 31 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Peak Latency Challenge. Case A: Batched Prefill and Decoding resulting in a latency peak. Case B: Decoding data transfer overhead between instances due to a large prefill request.
  • Figure 2: The KV cache of different queries is allocated on different instances. Query F (KV cache on Instance 3), is completed in time step $t$. In step $t+1$ (and many later), device 3 is idle. We use the free space on the devices to keep redundant copies of the KV cache. After query F is completed, there is enough KV cache on every device to keep good load balancing.
  • Figure 3: Prefill-phase execution time and throughput.
  • Figure 4: Decoding-phase execution time and throughput.
  • Figure 5: Left: Integrating batching with prefill in the decoding phase increases token generation latency by over 300$\%$. Right: Imbalance arises when batching 40 requests per instance, increasing token generation by 7.2ms compared to parallel execution of the same requests across two instances with a batch size of 20.
  • ...and 11 more figures