Table of Contents
Fetching ...

LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

Jie Kong, Wei Wang, Jiehan Zhou, Chen Yu

TL;DR

LLM-CoOpt tackles the trio of bottlenecks in large language model inference on heterogeneous hardware: KV-cache memory bandwidth, redundant computation in multi-head attention, and long-context processing. It introduces three co-design techniques—Opt-KV for memory-efficient KV caching with FP8 compression, Opt-GQA for grouped-query attention to reduce computation, and Opt-Pa for optimized paged attention for long sequences. Across diverse LLaMa-GPTQ variants, the framework delivers up to $13.43\%$ throughput improvement and up to $16.79\%$ lower latency while maintaining accuracy, demonstrating practical benefits for real-world deployment on constrained hardware. The results validate the viability of an end-to-end algorithm–hardware co-design approach to accelerate LLM inference on heterogeneous platforms.

Abstract

Major challenges in LLMs inference remain frequent memory bandwidth bottlenecks, computational redundancy, and inefficiencies in long-sequence processing. To address these issues, we propose LLM-CoOpt, a comprehensive algorithmhardware co-design framework aimed at improving both throughput and latency in LLM inference. LLM-CoOpt integrates three key strategies: (1) Key-Value Cache Optimization, termed Opt-KV, which improves memory access efficiency by optimizing both KV cache write and read paths, and introduces FP8 quantization to reduce memory footprint while maintaining accuracy; (2) Grouped-Query Attention for Computational Efficiency, termed Opt-GQA, which reduces the overall computational complexity by restructuring multi-head self-attention into grouped-query attention with shared key-value projections, enabling higher throughput and lower resource consumption; (3) Paged Attention for Long- Sequence Processing, termed Opt-Pa, which adopts a two-step strategy to first segment long sequences into manageable chunks and then apply lazy memory mapping and computation, significantly reducing memory pressure and improving performance on long-context inputs.Experiments on the LLaMa-13BGPTQ model demonstrate that LLM-CoOpt increases inference throughput by up to 13.43%, reduces latency by up to 16.79%, and maintains model accuracy. These results confirm that LLM-CoOpt provides a practical, high-performance optimization path for real-world inference of large-scale language models.

LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

TL;DR

LLM-CoOpt tackles the trio of bottlenecks in large language model inference on heterogeneous hardware: KV-cache memory bandwidth, redundant computation in multi-head attention, and long-context processing. It introduces three co-design techniques—Opt-KV for memory-efficient KV caching with FP8 compression, Opt-GQA for grouped-query attention to reduce computation, and Opt-Pa for optimized paged attention for long sequences. Across diverse LLaMa-GPTQ variants, the framework delivers up to throughput improvement and up to lower latency while maintaining accuracy, demonstrating practical benefits for real-world deployment on constrained hardware. The results validate the viability of an end-to-end algorithm–hardware co-design approach to accelerate LLM inference on heterogeneous platforms.

Abstract

Major challenges in LLMs inference remain frequent memory bandwidth bottlenecks, computational redundancy, and inefficiencies in long-sequence processing. To address these issues, we propose LLM-CoOpt, a comprehensive algorithmhardware co-design framework aimed at improving both throughput and latency in LLM inference. LLM-CoOpt integrates three key strategies: (1) Key-Value Cache Optimization, termed Opt-KV, which improves memory access efficiency by optimizing both KV cache write and read paths, and introduces FP8 quantization to reduce memory footprint while maintaining accuracy; (2) Grouped-Query Attention for Computational Efficiency, termed Opt-GQA, which reduces the overall computational complexity by restructuring multi-head self-attention into grouped-query attention with shared key-value projections, enabling higher throughput and lower resource consumption; (3) Paged Attention for Long- Sequence Processing, termed Opt-Pa, which adopts a two-step strategy to first segment long sequences into manageable chunks and then apply lazy memory mapping and computation, significantly reducing memory pressure and improving performance on long-context inputs.Experiments on the LLaMa-13BGPTQ model demonstrate that LLM-CoOpt increases inference throughput by up to 13.43%, reduces latency by up to 16.79%, and maintains model accuracy. These results confirm that LLM-CoOpt provides a practical, high-performance optimization path for real-world inference of large-scale language models.
Paper Structure (14 sections, 13 equations, 7 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 13 equations, 7 figures, 2 tables, 3 algorithms.

Figures (7)

  • Figure 1: Memory Bottlenecks in KV Cache
  • Figure 2: MHA: multi-head attention(Based on 2).
  • Figure 3: Storage Fragmentation Issue
  • Figure 4: Opt-GQA: A Lightweight Grouped-Query Attention Design
  • Figure 5: Opt-Pa: Optimized paged-attention design
  • ...and 2 more figures