Table of Contents
Fetching ...

A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

Jun-Peng Zhu, Qizhi Wang, Yulong Zhai, Yishen Sun, Sen Chen, Kai Xu, Peng Cai, Hongming Zhang, Heng Long, Liu Tang, Qi Liu

Abstract

Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.

A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

Abstract

Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.
Paper Structure (35 sections, 16 equations, 5 figures, 12 tables, 4 algorithms)

This paper contains 35 sections, 16 equations, 5 figures, 12 tables, 4 algorithms.

Figures (5)

  • Figure 1: The architecture of SCOUT.
  • Figure 2: Cost dominance map over $(c_{\text{fn}},c_{\text{auto}})$ with $c_{\text{fp}}{=}1$ on the temporal 60/20/20 split. Uncalibrated scores are never best; isotonic dominates in 71% of grid points and sigmoid in 29%.
  • Figure 3: OA-Cal sensitivity for decision portability. Left: label-scarce learning curve (OA-Cal normalized fixed-$\tau^*$ cost versus calibration-set size $n_{\text{cal}}$) for two transfer scenarios. Middle: controlled overlap sweep (mix target with source; annotate mix fraction; color = ESS/$n_{\text{cal}}$); importance weighting helps only under moderate overlap. Right: fixed-$\tau^*$ cost versus Cost-UCB $\kappa$ (bars) with an inset sweeping the AUC gate threshold.
  • Figure 4: CPU-only latency breakdown in Apple M4. We separate model-only inference from the end-to-end serving path to avoid under-reporting deployment latency.
  • Figure 5: Top: global interpretability summary (state-only LR, strict-causal temporal split). Bottom: face-validity check via do-style perturbations on canonical state features (increase one standardized feature at a time and measure mean predicted $p(\mathrm{flaky})$).