Table of Contents
Fetching ...

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

Syed Eqbal Alam, Zhan Shu

Abstract

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

Abstract

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of , for modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

Paper Structure

This paper contains 4 sections, 8 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Block diagram for essential components of AI agent-critic interaction. Models represent the foundation models; for example, large language models, large vision models, etc.
  • Figure 2: Density plots for input data rate, output data rate, bandwidth, and bytes sent for the CSV file (all rows) of the network telemetry dataset $4$ of Putina2021.
  • Figure 3: The plots for input data rate, load interval, bandwidth, output drops versus time for a few data points from the CSV dataset.
  • Figure 4: Federated multi-agent system's block diagram: AI agents and critics coordinate with a central server to complete multimodal tasks.