Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions
Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall
TL;DR
This paper addresses the challenge of evaluating multi-turn objective-driven interactions without ground-truth labels by introducing three unsupervised metrics: goal labeling, completion labeling, and LLM uncertainty. It combines an LLM-guided clustering approach to discover intents, a distribution-based completeness test using a fine-tuned model to approximate a shifted data distribution, and a response-tree framework to quantify uncertainty beyond single-token likelihoods. Across diverse open-domain and task-specific datasets, the authors demonstrate that small, fine-tuned models can match or exceed larger LLM judges in completion labeling and provide robust insights into uncertainty and distributional shifts. The proposed framework enables scalable, online monitoring and potential intervention strategies for enterprise AI agents, reducing reliance on costly human annotations and large judges. Overall, the work lays a practical foundation for judge-free evaluation of complex, enterprise-scale LLM interactions with implications for deployment and governance of AI agents.
Abstract
Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
