Table of Contents
Fetching ...

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Leo Donisch, Sigurd Schacht, Carsten Lanquillon

TL;DR

The paper surveys inference optimization techniques for large language models, focusing on reducing resource usage while maintaining performance. It organizes methods into quantization, pruning, knowledge distillation, and architectural optimization, and discusses their practical implications, hardware dependencies, and deployment trade-offs. Key insights include matured 8-bit quantization approaches and PTQ/QAT paradigms, challenges in identifying prune-worthy structures for very large models, and the potential of architectural techniques like Flash Attention and Speculative Decoding to achieve substantial speedups. The findings offer a practical guide for selecting method mixes tailored to target hardware and latency requirements across diverse deployment scenarios.

Abstract

Large language models are ubiquitous in natural language processing because they can adapt to new tasks without retraining. However, their sheer scale and complexity present unique challenges and opportunities, prompting researchers and practitioners to explore novel model training, optimization, and deployment methods. This literature review focuses on various techniques for reducing resource requirements and compressing large language models, including quantization, pruning, knowledge distillation, and architectural optimizations. The primary objective is to explore each method in-depth and highlight its unique challenges and practical applications. The discussed methods are categorized into a taxonomy that presents an overview of the optimization landscape and helps navigate it to understand the research trajectory better.

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

TL;DR

The paper surveys inference optimization techniques for large language models, focusing on reducing resource usage while maintaining performance. It organizes methods into quantization, pruning, knowledge distillation, and architectural optimization, and discusses their practical implications, hardware dependencies, and deployment trade-offs. Key insights include matured 8-bit quantization approaches and PTQ/QAT paradigms, challenges in identifying prune-worthy structures for very large models, and the potential of architectural techniques like Flash Attention and Speculative Decoding to achieve substantial speedups. The findings offer a practical guide for selecting method mixes tailored to target hardware and latency requirements across diverse deployment scenarios.

Abstract

Large language models are ubiquitous in natural language processing because they can adapt to new tasks without retraining. However, their sheer scale and complexity present unique challenges and opportunities, prompting researchers and practitioners to explore novel model training, optimization, and deployment methods. This literature review focuses on various techniques for reducing resource requirements and compressing large language models, including quantization, pruning, knowledge distillation, and architectural optimizations. The primary objective is to explore each method in-depth and highlight its unique challenges and practical applications. The discussed methods are categorized into a taxonomy that presents an overview of the optimization landscape and helps navigate it to understand the research trajectory better.
Paper Structure (11 sections, 5 figures, 3 tables)

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Taxonomy of optimization techniques
  • Figure 2: Different representation formats used in machine learning kharya2020
  • Figure 3: Symmetric Quantization example symAsym_intel_pic
  • Figure 4: Asymmetric Quantization example symAsym_intel_pic
  • Figure 5: Example generation out of Leviathan2022FastIF, where green are accepted generations, red and blue are rejections and corrections, respectively.