Table of Contents
Fetching ...

Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference

Yuyang Liu, Jingjing Cai, Jiayi Ren, Peng Zhou, Danyang Zhang, Yin Du, Shijian Li

TL;DR

This work presents Kunlun Anomaly Troubleshooter (KAT), a two-module framework designed for large model distributed inference (LMDI) anomaly troubleshooting. It leverages kernel-level function trace data to perform nanosecond-resolution anomaly detection (Outpost) using Trace Tree Anomaly Detection (TTAD) that exploits inter- and intra-worker synchronicity, and couples this with Analyzer, a domain-adapted LLM that provides interpretable causal reasoning and root-cause analysis. The authors introduce LMDIA, a full-stack multimodal dataset capturing trace events, performance metrics, and logs across 42 LM inference tasks and 11 anomaly scenarios, enabling precise detection and reasoning. In production evaluation on Alibaba Cloud, KAT achieves high precision (0.884) and recall (0.936) in anomaly detection and demonstrates competitive, often superior, causal reasoning performance against state-of-the-art models, indicating practical impact for efficient and accurate LMDI troubleshooting.

Abstract

Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of complex anomaly symptoms. Evaluations conducted in Alibaba Cloud Service production environment indicate that KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, providing detail anomaly insights that significantly narrow down the diagnostic scope and improve both the efficiency and success rate of troubleshooting.

Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference

TL;DR

This work presents Kunlun Anomaly Troubleshooter (KAT), a two-module framework designed for large model distributed inference (LMDI) anomaly troubleshooting. It leverages kernel-level function trace data to perform nanosecond-resolution anomaly detection (Outpost) using Trace Tree Anomaly Detection (TTAD) that exploits inter- and intra-worker synchronicity, and couples this with Analyzer, a domain-adapted LLM that provides interpretable causal reasoning and root-cause analysis. The authors introduce LMDIA, a full-stack multimodal dataset capturing trace events, performance metrics, and logs across 42 LM inference tasks and 11 anomaly scenarios, enabling precise detection and reasoning. In production evaluation on Alibaba Cloud, KAT achieves high precision (0.884) and recall (0.936) in anomaly detection and demonstrates competitive, often superior, causal reasoning performance against state-of-the-art models, indicating practical impact for efficient and accurate LMDI troubleshooting.

Abstract

Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of complex anomaly symptoms. Evaluations conducted in Alibaba Cloud Service production environment indicate that KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, providing detail anomaly insights that significantly narrow down the diagnostic scope and improve both the efficiency and success rate of troubleshooting.

Paper Structure

This paper contains 20 sections, 7 figures, 1 table, 2 algorithms.

Figures (7)

  • Figure 1: Overview of Kunlun Anomaly Troubleshooter
  • Figure 2: Example of Trace Event
  • Figure 3: Trace Timelines of a Two Parallel GPUs Inference
  • Figure 4: LMDIA Data Monitoring Structure
  • Figure 5: Pipeline of KAT Outpost Anomaly Detection on Trace Data
  • ...and 2 more figures