Table of Contents
Fetching ...

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall

TL;DR

This paper addresses the challenge of evaluating multi-turn objective-driven interactions without ground-truth labels by introducing three unsupervised metrics: goal labeling, completion labeling, and LLM uncertainty. It combines an LLM-guided clustering approach to discover intents, a distribution-based completeness test using a fine-tuned model to approximate a shifted data distribution, and a response-tree framework to quantify uncertainty beyond single-token likelihoods. Across diverse open-domain and task-specific datasets, the authors demonstrate that small, fine-tuned models can match or exceed larger LLM judges in completion labeling and provide robust insights into uncertainty and distributional shifts. The proposed framework enables scalable, online monitoring and potential intervention strategies for enterprise AI agents, reducing reliance on costly human annotations and large judges. Overall, the work lays a practical foundation for judge-free evaluation of complex, enterprise-scale LLM interactions with implications for deployment and governance of AI agents.

Abstract

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

TL;DR

This paper addresses the challenge of evaluating multi-turn objective-driven interactions without ground-truth labels by introducing three unsupervised metrics: goal labeling, completion labeling, and LLM uncertainty. It combines an LLM-guided clustering approach to discover intents, a distribution-based completeness test using a fine-tuned model to approximate a shifted data distribution, and a response-tree framework to quantify uncertainty beyond single-token likelihoods. Across diverse open-domain and task-specific datasets, the authors demonstrate that small, fine-tuned models can match or exceed larger LLM judges in completion labeling and provide robust insights into uncertainty and distributional shifts. The proposed framework enables scalable, online monitoring and potential intervention strategies for enterprise AI agents, reducing reliance on costly human annotations and large judges. Overall, the work lays a practical foundation for judge-free evaluation of complex, enterprise-scale LLM interactions with implications for deployment and governance of AI agents.

Abstract

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Paper Structure

This paper contains 33 sections, 4 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Identifying incomplete interactions via LLM completion of the full interaction.
  • Figure 2: Simplified response tree for the prompt "How many 'r's are in the word strawberry?". Lighter branches correspond to less probable responses.
  • Figure 3: Top 10 largest clusters for LMSYS
  • Figure 4: Labeling confusion matrices for two runs of LLM-supervised clustering (top) and an LLM-only labeling baseline (bottom). To visualize changes across two clustering runs, we compute a matrix $D$ where $D_{ij}$ is the number of elements in both cluster $i$ for run 1 and cluster $j$ for run 2, then sort the matrix to align the largest elements on the diagonal.
  • Figure 5: Top 10 largest clusters for objective-driven datasets. We report full labels in Appendix \ref{['appendix:results']}.
  • ...and 8 more figures