Enhancing reliability in AI inference services: An empirical study on real production incidents
Bhala Ranganathan, Mickey Zhang, Kai Wu
TL;DR
The paper addresses reliability challenges in hyperscale LLM inference by conducting an empirical, practice-based study of 156 high-severity production incidents. It introduces a four-way taxonomy (infrastructure, model configuration, inference engine, operational) and demonstrates strong labeling consistency (Cohen’s κ ≈ $0.89$), quantifying dominant failure modes and mitigation outcomes. The study shows that most incidents are resolved via operational actions and automation (auto-detection ≈ $74\%$; hotfixes ≈ $28\%$), with substantial gains from traffic routing, node rebalancing, and capacity-based policies; it also documents significant automation opportunities through AIOps and intelligent failover. Practically, the work provides a practitioner-oriented adoption checklist and actionable guidance to improve monitoring, traffic routing, and deployment validation, ultimately enabling more reliable and cost-efficient LLM serving at scale.
Abstract
Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.
