Table of Contents
Fetching ...

Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

Deepak Babu Piskala, Sharlene Chen, Udita Patel, Parul Kalra, Rafael Castrillo

TL;DR

The paper presents a goal-focused evaluation framework for multi-turn conversational agents, introducing $GSR$ (Goal Success Rate) and a $RCOF$ (Root Cause of Failure) taxonomy, and demonstrates a data-efficient, explainable approach using a teacher-model ensemble with Chain-of-Thought prompting in a HITL loop. It segments dialogs into user goals, evaluates goal-level success using $GSR$, and attributes failures to actionable categories via $RCOF$, enabling targeted improvements. Applied to an enterprise chatbot ($AIDA$) with ~10k dialogs, the framework identifies quality patterns and tracks $GSR$ improvements over time, guiding system evolution. The work advances robust, end-to-end diagnostics for complex, multi-agent chat systems and provides a scalable pipeline for enterprise deployments.

Abstract

Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the \textbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a \textbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling \textit{explainable}, \textit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63\% to 79\% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.

Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

TL;DR

The paper presents a goal-focused evaluation framework for multi-turn conversational agents, introducing (Goal Success Rate) and a (Root Cause of Failure) taxonomy, and demonstrates a data-efficient, explainable approach using a teacher-model ensemble with Chain-of-Thought prompting in a HITL loop. It segments dialogs into user goals, evaluates goal-level success using , and attributes failures to actionable categories via , enabling targeted improvements. Applied to an enterprise chatbot () with ~10k dialogs, the framework identifies quality patterns and tracks improvements over time, guiding system evolution. The work advances robust, end-to-end diagnostics for complex, multi-agent chat systems and provides a scalable pipeline for enterprise deployments.

Abstract

Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the \textbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a \textbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling \textit{explainable}, \textit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63\% to 79\% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.

Paper Structure

This paper contains 16 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Signals used for evaluating chatbot quality
  • Figure 2: Goal-oriented breakdown of a multi-turn employee chatbot conversation. The full dialog (left) is segmented into three distinct goals: $G_1$ (policy inquiry), $G_2$ (project clarification), and $G_3$ (translation request). Each goal is independently evaluated for success or failure using goal-level metrics. Turn-level evaluation may suggest high success, but goal-level evaluation reveals that $G_1$ and $G_2$ failed due to reasoning and language understanding errors, respectively. Root Cause of Failure (RCOF) is annotated using structured rationale snippets (right), highlighting the earliest defective turn per goal.
  • Figure 3: Anatomy of a chatbot session: Each dialog consists of multiple goals ($G_i$), where each goal comprises one or more turns ($T_j$), formed by a query-response pair ($q_j$, $r_j$). Goal $G_3$ is marked failed due to information unavailability.
  • Figure 4: (a) Distribution of the top five chatbot failure root causes, annotated with both absolute counts and percentages. (b) Trend of the overall Goal-Success Rate (GSR) from Oct ’24 through May ’25, showing a steady improvement.
  • Figure 5: HITL evaluation pipeline: AIDA conversations are processed by multiple expert models using Chain-of-Thought prompts. Majority-voted outputs are accepted as labels. Disagreements are escalated to human experts guided by SOPs.