Table of Contents
Fetching ...

Evaluation and Continual Improvement for an Enterprise AI Assistant

Akash V. Maharaj, Kun Qian, Uttaran Bhattacharya, Sally Fang, Horia Galatanu, Manas Garg, Rachel Hanessian, Nishant Kapoor, Ken Russell, Shivakumar Vaithyanathan, Yunyao Li

TL;DR

The paper tackles the challenge of evaluating and continually improving an enterprise AI assistant during active development, where a complex, multi-component pipeline must evolve with changing data and user needs. It introduces a production-focused continual improvement framework centered on a severity-based error taxonomy, a human-in-the-loop annotation and error-analysis workflow, and an end-to-end evaluation environment that prioritizes actionable insights over traditional automated metrics. Preliminary results demonstrate concrete gains, such as a 90% precision Out-of-Scope classifier reducing Sev-0 errors and UI overrides converting Sev-1 errors into Sev-2 recoveries, illustrating the framework's practical impact. Together, these contributions offer a scalable, human-centered approach to improving enterprise assistants, with direct implications for user trust, engagement, and productivity.

Abstract

The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned.

Evaluation and Continual Improvement for an Enterprise AI Assistant

TL;DR

The paper tackles the challenge of evaluating and continually improving an enterprise AI assistant during active development, where a complex, multi-component pipeline must evolve with changing data and user needs. It introduces a production-focused continual improvement framework centered on a severity-based error taxonomy, a human-in-the-loop annotation and error-analysis workflow, and an end-to-end evaluation environment that prioritizes actionable insights over traditional automated metrics. Preliminary results demonstrate concrete gains, such as a 90% precision Out-of-Scope classifier reducing Sev-0 errors and UI overrides converting Sev-1 errors into Sev-2 recoveries, illustrating the framework's practical impact. Together, these contributions offer a scalable, human-centered approach to improving enterprise assistants, with direct implications for user trust, engagement, and productivity.

Abstract

The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned.
Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Assistant Overall Architecture
  • Figure 2: Evaluation and continual improvement framework of Assistant
  • Figure 3: Dashboard showing snapshot of Error Severities and time-evolution for a single component. Illustrative data of similar magnitude to production numbers.