Table of Contents
Fetching ...

Detecting Silent Failures in Multi-Agentic AI Trajectories

Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, Pratibha Moogi

TL;DR

Detecting silent failures in non-deterministic Multi-Agentic AI trajectories is challenging due to drift, cycles, and missing details. The authors propose anomaly detection on end-to-end agentic traces and build a dataset curation pipeline that captures user behavior, non-determinism, and LLM variation, yielding two labeled benchmarks with 4,275 and 894 trajectories. They benchmark supervised (XGBoost) and semi-supervised (SVDD) methods, achieving up to 98% and 96% accuracies, and perform feature-importance analyses and error analysis to understand misclassifications. The work provides the first systematic datasets, baselines, and practical insights to guide robust anomaly detection in agentic AI systems, with implications for reducing computational cost, token usage, and erroneous behavior.

Abstract

Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic and prone to silent failures such as drift, cycles, and missing details in outputs, which are difficult to detect. We introduce the task of anomaly detection in agentic trajectories to identify these failures and present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variation. Using this pipeline, we curate and label two benchmark datasets comprising \textbf{4,275 and 894} trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection methods on these datasets, we show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively. This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.

Detecting Silent Failures in Multi-Agentic AI Trajectories

TL;DR

Detecting silent failures in non-deterministic Multi-Agentic AI trajectories is challenging due to drift, cycles, and missing details. The authors propose anomaly detection on end-to-end agentic traces and build a dataset curation pipeline that captures user behavior, non-determinism, and LLM variation, yielding two labeled benchmarks with 4,275 and 894 trajectories. They benchmark supervised (XGBoost) and semi-supervised (SVDD) methods, achieving up to 98% and 96% accuracies, and perform feature-importance analyses and error analysis to understand misclassifications. The work provides the first systematic datasets, baselines, and practical insights to guide robust anomaly detection in agentic AI systems, with implications for reducing computational cost, token usage, and erroneous behavior.

Abstract

Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic and prone to silent failures such as drift, cycles, and missing details in outputs, which are difficult to detect. We introduce the task of anomaly detection in agentic trajectories to identify these failures and present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variation. Using this pipeline, we curate and label two benchmark datasets comprising \textbf{4,275 and 894} trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection methods on these datasets, we show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively. This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.

Paper Structure

This paper contains 10 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Dataset Curation Pipeline for Multi-Agentic AI Systems
  • Figure 2: t-SNE of normal and anomalous agentic traces