Table of Contents
Fetching ...

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Chihiro Watanabe, Jingyu Sun

TL;DR

A novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers is proposed, which enables a more nuanced assessment of response quality beyond exact match metrics.

Abstract

Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

TL;DR

A novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers is proposed, which enables a more nuanced assessment of response quality beyond exact match metrics.

Abstract

Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the -medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
Paper Structure (9 sections, 7 equations, 5 figures, 1 algorithm)

This paper contains 9 sections, 7 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: The proposed framework of ECDF clustering for LLM-based agent system analysis.
  • Figure 2: ECDFs for each value of accuracy in setting P. Blue line shows the centroid.
  • Figure 4: ECDFs of each cluster in setting P. Black and blue lines show the medoid and centroid of the cluster, respectively.
  • Figure 6: ECDFs of each cluster in setting T.
  • Figure 8: Example answers of medoids of Clusters $0$, $7$, and $15$ in setting P. Regarding the candidate answers, only unique answers are listed.