Table of Contents
Fetching ...

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Elad Levi, Ilan Kadar

TL;DR

The paper addresses the challenge of evaluating conversational AI in multi-turn, tool-augmented settings where traditional benchmarks fall short in scalability and diagnostic detail. It introduces IntellAgent, a policy-graph–driven, synthetic benchmark framework that automates event generation and user–agent dialogues to produce fine-grained diagnostics. Key contributions include a scalable open-source pipeline, a graph-based representation of policy interactions, and automated, domain-agnostic evaluation that correlates strongly with established benchmarks despite using synthetic data. The framework enables targeted optimization for complex policy interactions and tool usage, supporting reproducible, cross-domain evaluation and faster deployment of robust conversational agents.

Abstract

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

TL;DR

The paper addresses the challenge of evaluating conversational AI in multi-turn, tool-augmented settings where traditional benchmarks fall short in scalability and diagnostic detail. It introduces IntellAgent, a policy-graph–driven, synthetic benchmark framework that automates event generation and user–agent dialogues to produce fine-grained diagnostics. Key contributions include a scalable open-source pipeline, a graph-based representation of policy interactions, and automated, domain-agnostic evaluation that correlates strongly with established benchmarks despite using synthetic data. The framework enables targeted optimization for complex policy interactions and tool usage, supporting reproducible, cross-domain evaluation and faster deployment of robust conversational agents.

Abstract

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent
Paper Structure (17 sections, 9 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: System diagram. (1) Given a chatbot prompt and a Schema DB, the system generates an event that targets a subset of policies, which includes a user request and a system DB state. (2) For each event the system simulates a conversation between the user and the chatbot. (3) A fine-grained report on the chatbot performances is generated.
  • Figure 2: Model success rates across different challenge levels. While all models show reduced performance as the challenge level increases, they exhibit distinct patterns of decline, differing in both the onset level and the magnitude of the decrease.
  • Figure 3: Comparison of the success rates of the top four models across various policy categories, highlighting that some categories are more challenging than others. Additionally, the relative performance order of different models varies across categories.
  • Figure 4: Event generator architecture overview.
  • Figure 5: Simulator architecture overview.
  • ...and 4 more figures