Table of Contents
Fetching ...

XAgen: An Explainability Tool for Identifying and Correcting Failures in Multi-Agent Workflows

Xinru Wang, Ming Yin, Eunyee Koh, Mustafa Doga Dogan

TL;DR

XAgen is an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge.

Abstract

As multi-agent systems powered by Large Language Models (LLMs) are increasingly adopted in real-world workflows, users with diverse technical backgrounds are now building and refining their own agentic processes. However, these systems can fail in opaque ways, making it difficult for users to observe, understand, and correct errors. We conducted formative interviews with 12 practitioners to identify mismatches between existing debugging tools and users' needs. Based on these insights, we designed XAgen, an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge. In a user study with 8 participants, XAgen helped users locate failures more easily, attribute to specific agents or steps, and iteratively improve configurations. Our findings surface human-centered design guidelines for explainable agentic AI development and highlight opportunities for more context-aware interactive debugging.

XAgen: An Explainability Tool for Identifying and Correcting Failures in Multi-Agent Workflows

TL;DR

XAgen is an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge.

Abstract

As multi-agent systems powered by Large Language Models (LLMs) are increasingly adopted in real-world workflows, users with diverse technical backgrounds are now building and refining their own agentic processes. However, these systems can fail in opaque ways, making it difficult for users to observe, understand, and correct errors. We conducted formative interviews with 12 practitioners to identify mismatches between existing debugging tools and users' needs. Based on these insights, we designed XAgen, an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge. In a user study with 8 participants, XAgen helped users locate failures more easily, attribute to specific agents or steps, and iteratively improve configurations. Our findings surface human-centered design guidelines for explainable agentic AI development and highlight opportunities for more context-aware interactive debugging.

Paper Structure

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Walkthrough of the XAgen interface. Users first select a project folder (➊). We use the CrewAI framework in this prototype, as practitioners in our organization who participated in the formative interview primarily relied on CrewAI in their daily work. The upper section of the central panel displays the workflow overview, and clicking Start Workflow (➋) executes the multi-agent workflow. The lower section of the central panel then shows the raw terminal logs (➌). During execution, each component in the flowchart is activated step by step in accordance with the log (➍). The right panel displays detailed information, including prompt configurations, tool calls, and agent rationales (➎). Task outputs are automatically evaluated by the LLM-as-a-judge (➏); the average historical evaluation score is shown as a ring, while detailed scores and rationales are listed in the right panel. The details panel also provides fields for manual feedback (➐). Users can edit prompt configurations directly in the interface (➑) and re-run the workflow to test improvements. Finally, each session can be replayed, allowing users to review historical performance and feedback (➒).
  • Figure 2: User study results comparing XAgen against a baseline (a) and evaluating helpfulness of XAgen's three core features (b).
  • Figure 3: Architecture of XAgen.