Table of Contents
Fetching ...

MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems

Jin Jia, Zhiling Deng, Zhuangbin Chen, Yingqi Wang, Zibin Zheng

TL;DR

MAS reliability for LLM-based multi-agent systems is challenged by silent semantic deviations that do not trigger runtime exceptions. The authors propose MAS-FIRE, a fault-injection and robustness-evaluation framework with a 15-fault taxonomy and three non-invasive injection mechanisms (prompt modification, interception/rewrite, and message routing) to diagnose failure origins and recovery paths. The study shows that architectural topology and foundation-model capability interact nonlinearly: iterative, closed-loop designs neutralize many faults that collapse linear pipelines, while higher-capability models aid semantic mitigation when verification paths exist but can hinder recovery when instructed adherence traps the system. A four-tier fault-tolerance taxonomy (Mechanism, Rule, Prompt, Reasoning) plus system-level and process-level metrics enable fine-grained diagnosis and actionable guidance for building robust MAS, with data and code publicly available for reproducibility.

Abstract

As LLM-based Multi-Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that propagate silently without raising runtime exceptions. Prevailing evaluation approaches, which measure only end-to-end task success, offer limited insight into how these failures arise or how effectively agents recover from them. To bridge this gap, we propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of MAS. We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures, and inject them via three non-invasive mechanisms: prompt modification, response rewriting, and message routing manipulation. Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors that we organize into four tiers: mechanism, rule, prompt, and reasoning. This tiered view enables fine-grained diagnosis of where and why systems succeed or fail. Our findings reveal that stronger foundation models do not uniformly improve robustness. We further show that architectural topology plays an equally decisive role, with iterative, closed-loop designs neutralizing over 40% of faults that cause catastrophic collapse in linear workflows. MAS-FIRE provides the process-level observability and actionable guidance needed to systematically improve multi-agent systems.

MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems

TL;DR

MAS reliability for LLM-based multi-agent systems is challenged by silent semantic deviations that do not trigger runtime exceptions. The authors propose MAS-FIRE, a fault-injection and robustness-evaluation framework with a 15-fault taxonomy and three non-invasive injection mechanisms (prompt modification, interception/rewrite, and message routing) to diagnose failure origins and recovery paths. The study shows that architectural topology and foundation-model capability interact nonlinearly: iterative, closed-loop designs neutralize many faults that collapse linear pipelines, while higher-capability models aid semantic mitigation when verification paths exist but can hinder recovery when instructed adherence traps the system. A four-tier fault-tolerance taxonomy (Mechanism, Rule, Prompt, Reasoning) plus system-level and process-level metrics enable fine-grained diagnosis and actionable guidance for building robust MAS, with data and code publicly available for reproducibility.

Abstract

As LLM-based Multi-Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that propagate silently without raising runtime exceptions. Prevailing evaluation approaches, which measure only end-to-end task success, offer limited insight into how these failures arise or how effectively agents recover from them. To bridge this gap, we propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of MAS. We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures, and inject them via three non-invasive mechanisms: prompt modification, response rewriting, and message routing manipulation. Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors that we organize into four tiers: mechanism, rule, prompt, and reasoning. This tiered view enables fine-grained diagnosis of where and why systems succeed or fail. Our findings reveal that stronger foundation models do not uniformly improve robustness. We further show that architectural topology plays an equally decisive role, with iterative, closed-loop designs neutralizing over 40% of faults that cause catastrophic collapse in linear workflows. MAS-FIRE provides the process-level observability and actionable guidance needed to systematically improve multi-agent systems.
Paper Structure (35 sections, 2 equations, 4 figures, 2 tables)

This paper contains 35 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Fault Injection Model for MAS
  • Figure 2: Examples of Fault Injection Mechanisms and Multi-Agent Recovery Behaviors. (a) Instruction Logic Conflict via Prompt Modification, which introduces incompatible constraints to evaluate requirement clarification; (b) Parameter Filling Error via Interception and Response Rewriting, which alters task parameters to assess inter-agent coordination and repair; (c) Message Storm via Message Routing Manipulation, which injects redundant communication to test infrastructure-level filtering. Green panels illustrate robust recovery (good behavior), while red panels highlight representative failure modes (bad behavior).
  • Figure 3: Robustness Score ($RS_f$) of Different MAS under 15 Fault Types
  • Figure 4: Fault-tolerance Performance of Different MAS under 15 Fault Types. Gray columns indicate that the corresponding faults cannot be injected due to system architecture limitations or output format constraints.