Table of Contents
Fetching ...

A First Look at Bugs in LLM Inference Engines

Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, Yun Ma

TL;DR

This paper addresses the gap in understanding bugs in LLM inference engines by performing the first large-scale empirical study across five popular engines, yielding a taxonomy of 28 root causes and six symptom types from $929$ real-world bugs. It combines open coding with reliability analysis and fix-strategy mapping to reveal consistent bug patterns across deployments, phases, and backends, and to quantify repair effort and temporal trends. The work shows that non-crash symptoms and environment/configuration/resource root causes dominate maintenance effort, and it offers concrete diagnostic patterns, cross-engine implications, and practical guidelines for developers, vendors, and researchers. The findings enable more robust testing, targeted debugging, and better tooling for LLM inference engines, with public data to fuel future research and tooling advances.

Abstract

Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, commonality, fix effort, fix strategies, and temporal evolution. Our findings reveal six bug symptom types and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers, along with general guidelines for developing LLM inference engines.

A First Look at Bugs in LLM Inference Engines

TL;DR

This paper addresses the gap in understanding bugs in LLM inference engines by performing the first large-scale empirical study across five popular engines, yielding a taxonomy of 28 root causes and six symptom types from real-world bugs. It combines open coding with reliability analysis and fix-strategy mapping to reveal consistent bug patterns across deployments, phases, and backends, and to quantify repair effort and temporal trends. The work shows that non-crash symptoms and environment/configuration/resource root causes dominate maintenance effort, and it offers concrete diagnostic patterns, cross-engine implications, and practical guidelines for developers, vendors, and researchers. The findings enable more robust testing, targeted debugging, and better tooling for LLM inference engines, with public data to fuel future research and tooling advances.

Abstract

Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, commonality, fix effort, fix strategies, and temporal evolution. Our findings reveal six bug symptom types and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers, along with general guidelines for developing LLM inference engines.

Paper Structure

This paper contains 47 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Workflow of LLM deployment.
  • Figure 2: Architecture of LLM inference engine.
  • Figure 3: Overview of methodology.
  • Figure 4: Distribution of unexpected factors.
  • Figure 5: Taxonomy of root causes of bugs in LLM inference engines. The top-left circle numbers indicate bug counts.
  • ...and 4 more figures