AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Pei Yang; Wanyi Chen; Asuka Yuxi Zheng; Xueqian Li; Xiang Li; Haoqin Tu; Jie Xiao; Yifan Pang; Dongdong Zhang; Fuqiang Li; Alfred Long; Bill Shi; Lynn Ai; Eric Yang

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Bill Shi, Lynn Ai, Eric Yang

TL;DR

This work presents AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints, and integrates three key components.

Abstract

Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed-Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state-of-the-art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 by 4.8 points while reducing variance by 35%.

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

TL;DR

Abstract

Paper Structure (61 sections, 5 equations, 15 figures, 18 tables, 2 algorithms)

This paper contains 61 sections, 5 equations, 15 figures, 18 tables, 2 algorithms.

Introduction
Related Work
AOI Runtime Architecture
Agent Components and Permissions
Runtime Pipeline
Execution Pipeline
Dual-Timescale Memory
Observer Step-Level Policy Optimization
GRPO Formulation
Multi-Dimensional Reward Function
Trajectory Evolver
Problem Formulation
Seeds: Definition and Data Source
GRPO-Optimized Trajectory Correction
Integration with AOI system
...and 46 more sections

Figures (15)

Figure 1: AOI System Overview. Left: Closed-Loop Evolution Pipeline---a Judge classifies SRE troubleshooting workflows by outcome. Failed workflows are repaired by the Evolver into corrected command sequences that serve as diagnostic guidance at inference time. Successful workflows are distilled by the Purifier into optimal diagnostic paths that serve as training data. Right: Multi-Agent Runtime---the Observer coordinates read-only diagnosis (Probe) and write-gated remediation (Executor). The Observer is trained via GRPO, and at inference time receives the Evolver's corrected plans as structured prompts.
Figure 2: AOI Runtime Agent Architecture. The Observer coordinates diagnosis through the Probe (read-only) and Executor (write-gated) agents. The Compressor maintains context efficiency via dual-timescale memory.
Figure 3: Performance across evaluation dimensions. Seeds-Failed, Base (untrained Qwen3-14B), and GRPO-trained Evolver compared on four reward dimensions and overall score.
Figure 4: Cumulative distribution of overall scores. Dashed lines indicate group means.
Figure 5: Trajectory-Corrective Evolver Architecture. The Evolver observes failed command sequences from the Observer's execution history, generates $G$ candidate corrections via GRPO sampling (sampling from a policy trained on successful trajectories), and provides the highest-scoring correction as a structured prompt to guide the Observer's next attempt. Key insight: This closed-loop mechanism converts failed diagnostic attempts into learning opportunities without requiring manual expert intervention.
...and 10 more figures

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

TL;DR

Abstract

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (15)