Table of Contents
Fetching ...

LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

Ron Solomon, Yarin Yerushalmi Levi, Lior Vaknin, Eran Aizikovich, Amit Baras, Etai Ohana, Amit Giloni, Shamik Bose, Chiara Picardi, Yuval Elovici, Asaf Shabtai

TL;DR

This work proposes LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques and demonstrates LumiMAS's effectiveness in failure detection, classification, and RCA.

Abstract

The incorporation of LLMs in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.

LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

TL;DR

This work proposes LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques and demonstrates LumiMAS's effectiveness in failure detection, classification, and RCA.

Abstract

The incorporation of LLMs in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.

Paper Structure

This paper contains 64 sections, 2 equations, 10 figures, 17 tables.

Figures (10)

  • Figure 1: High-level architecture of the LumiMAS observability framework incorporated in a given MAS
  • Figure 2: Overview of the anomaly detection architecture showing (1) the EPI-based autoencoder, (2) the semantic-based autoencoder, and (3) a combined detector that integrates both approaches
  • Figure 3: Root cause analysis (RCA) results obtained using CrewAI apps with GPT-4o-mini as the underlying model
  • Figure 4: Loss curve of the EPI-based anomaly detection approach on the Trip Planner application
  • Figure 5: Loss curve of the semantic-based anomaly detection approach on the Trip Planner application
  • ...and 5 more figures