Table of Contents
Fetching ...

AgentFM: Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents

Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Xiaosong Huang, Chiming Duan, Ying Li

TL;DR

The paper tackles failure management in distributed databases by revealing that different roles (system, data, task) have distinct impacts on detection and remediation. It proposes AgentFM, a role-aware framework powered by an LLM-driven multi-agent system orchestrated by a meta-agent, using traces, metrics, and logs to guide detection, diagnosis, and mitigation via a RAG+CoT paradigm. A preliminary empirical study and a deployment on Apache IoTDB demonstrate the framework's feasibility and strong anomaly detection and diagnosis performance, with actionable mitigation guidance. These findings suggest that role-aware, multi-agent, LLM-driven approaches can enhance real-world reliability and remediation in large-scale distributed data systems.

Abstract

Distributed databases are critical infrastructures for today's large-scale software systems, making effective failure management essential to ensure software availability. However, existing approaches often overlook the role distinctions within distributed databases and rely on small-scale models with limited generalization capabilities. In this paper, we conduct a preliminary empirical study to emphasize the unique significance of different roles. Building on this insight, we propose AgentFM, a role-aware failure management framework for distributed databases powered by LLM-driven multi-agents. AgentFM addresses failure management by considering system roles, data roles, and task roles, with a meta-agent orchestrating these components. Preliminary evaluations using Apache IoTDB demonstrate the effectiveness of AgentFM and open new directions for further research.

AgentFM: Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents

TL;DR

The paper tackles failure management in distributed databases by revealing that different roles (system, data, task) have distinct impacts on detection and remediation. It proposes AgentFM, a role-aware framework powered by an LLM-driven multi-agent system orchestrated by a meta-agent, using traces, metrics, and logs to guide detection, diagnosis, and mitigation via a RAG+CoT paradigm. A preliminary empirical study and a deployment on Apache IoTDB demonstrate the framework's feasibility and strong anomaly detection and diagnosis performance, with actionable mitigation guidance. These findings suggest that role-aware, multi-agent, LLM-driven approaches can enhance real-world reliability and remediation in large-scale distributed data systems.

Abstract

Distributed databases are critical infrastructures for today's large-scale software systems, making effective failure management essential to ensure software availability. However, existing approaches often overlook the role distinctions within distributed databases and rely on small-scale models with limited generalization capabilities. In this paper, we conduct a preliminary empirical study to emphasize the unique significance of different roles. Building on this insight, we propose AgentFM, a role-aware failure management framework for distributed databases powered by LLM-driven multi-agents. AgentFM addresses failure management by considering system roles, data roles, and task roles, with a meta-agent orchestrating these components. Preliminary evaluations using Apache IoTDB demonstrate the effectiveness of AgentFM and open new directions for further research.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: AgentFM Architecture
  • Figure 2: System Agents Adaptation Workflow
  • Figure 3: Sample Mitigation Solutions from AgentFM