Evaluation and Continual Improvement for an Enterprise AI Assistant
Akash V. Maharaj, Kun Qian, Uttaran Bhattacharya, Sally Fang, Horia Galatanu, Manas Garg, Rachel Hanessian, Nishant Kapoor, Ken Russell, Shivakumar Vaithyanathan, Yunyao Li
TL;DR
The paper tackles the challenge of evaluating and continually improving an enterprise AI assistant during active development, where a complex, multi-component pipeline must evolve with changing data and user needs. It introduces a production-focused continual improvement framework centered on a severity-based error taxonomy, a human-in-the-loop annotation and error-analysis workflow, and an end-to-end evaluation environment that prioritizes actionable insights over traditional automated metrics. Preliminary results demonstrate concrete gains, such as a 90% precision Out-of-Scope classifier reducing Sev-0 errors and UI overrides converting Sev-1 errors into Sev-2 recoveries, illustrating the framework's practical impact. Together, these contributions offer a scalable, human-centered approach to improving enterprise assistants, with direct implications for user trust, engagement, and productivity.
Abstract
The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned.
