Evaluation and Continual Improvement for an Enterprise AI Assistant

Akash V. Maharaj; Kun Qian; Uttaran Bhattacharya; Sally Fang; Horia Galatanu; Manas Garg; Rachel Hanessian; Nishant Kapoor; Ken Russell; Shivakumar Vaithyanathan; Yunyao Li

Evaluation and Continual Improvement for an Enterprise AI Assistant

Akash V. Maharaj, Kun Qian, Uttaran Bhattacharya, Sally Fang, Horia Galatanu, Manas Garg, Rachel Hanessian, Nishant Kapoor, Ken Russell, Shivakumar Vaithyanathan, Yunyao Li

TL;DR

The paper tackles the challenge of evaluating and continually improving an enterprise AI assistant during active development, where a complex, multi-component pipeline must evolve with changing data and user needs. It introduces a production-focused continual improvement framework centered on a severity-based error taxonomy, a human-in-the-loop annotation and error-analysis workflow, and an end-to-end evaluation environment that prioritizes actionable insights over traditional automated metrics. Preliminary results demonstrate concrete gains, such as a 90% precision Out-of-Scope classifier reducing Sev-0 errors and UI overrides converting Sev-1 errors into Sev-2 recoveries, illustrating the framework's practical impact. Together, these contributions offer a scalable, human-centered approach to improving enterprise assistants, with direct implications for user trust, engagement, and productivity.

Abstract

The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned.

Evaluation and Continual Improvement for an Enterprise AI Assistant

TL;DR

Abstract

Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Introduction
Limitations of Existing Approaches
Limitations of Explicit Feedback
Limitations of Implicit Feedback
Limitations of Off-the-Shelf Benchmarks
Our Approach
Design Decisions
Severity-based Error Taxonomy
Framework for Evaluation and Continual Improvement
Preliminary Results: Examples
Discussion
Future Work

Figures (3)

Figure 1: Assistant Overall Architecture
Figure 2: Evaluation and continual improvement framework of Assistant
Figure 3: Dashboard showing snapshot of Error Severities and time-evolution for a single component. Illustrative data of similar magnitude to production numbers.

Evaluation and Continual Improvement for an Enterprise AI Assistant

TL;DR

Abstract

Evaluation and Continual Improvement for an Enterprise AI Assistant

Authors

TL;DR

Abstract

Table of Contents

Figures (3)