Table of Contents
Fetching ...

SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

Rui Ren, Jingbang Yang, Linxiao Yang, Xinyue Gu, Liang Sun

TL;DR

This work tackles fault localization in rapidly changing microservice environments where failures from service changes are often minority events. It introduces SLIM, a scalable, interpretable classifier built on disjunctive rule sets that directly optimizes the $F1$ score under cardinality constraints, enabling online training with about 15% overhead. The approach combines log-to-metric extraction, binary feature encoding, and a novel rule-learning pipeline that uses submodular optimization and Majorization-Minimization to generate compact, interpretable rules. The authors demonstrate superior accuracy and interpretability against strong baselines on multiple benchmarks and a real-world cloud dataset, and show effective knowledge-base generation for RCA platforms. SLIM thus offers a practical, low-cost solution for timely root-cause analysis in imbalanced, evolving microservice systems.

Abstract

The newly deployed service -- one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.

SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

TL;DR

This work tackles fault localization in rapidly changing microservice environments where failures from service changes are often minority events. It introduces SLIM, a scalable, interpretable classifier built on disjunctive rule sets that directly optimizes the score under cardinality constraints, enabling online training with about 15% overhead. The approach combines log-to-metric extraction, binary feature encoding, and a novel rule-learning pipeline that uses submodular optimization and Majorization-Minimization to generate compact, interpretable rules. The authors demonstrate superior accuracy and interpretability against strong baselines on multiple benchmarks and a real-world cloud dataset, and show effective knowledge-base generation for RCA platforms. SLIM thus offers a practical, low-cost solution for timely root-cause analysis in imbalanced, evolving microservice systems.

Abstract

The newly deployed service -- one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.
Paper Structure (32 sections, 1 theorem, 13 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 13 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

lemma 1

Both $V^{1}_{\alpha}(r|r^{(t)})$ and $V^{2}_{\alpha}(r|r^{(t)})$ are non-monotone submodular functions.

Figures (7)

  • Figure 1: Log Extraction Module: Log Parsing, Matching and Analyzing.
  • Figure 2: Example of the proposed rule selection strategy.
  • Figure 3: Framework of Rule Generation.
  • Figure 4: The overview of Knowledge Base generation
  • Figure 5: The Overhead of all Algorithms on Benchmark Datasets.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 1: Rule
  • Definition 2: Rule Set
  • lemma 1