Table of Contents
Fetching ...

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish Thakker

TL;DR

The work tackles the inadequacy of existing benchmarks to capture domain-specific and multilingual behavior of LLMs in chat-like settings. It introduces a refreshable data pipeline that converts unlabeled data into labeled, domain-diverse clusters using embeddings, seed labeling, and a $k$-NN classifier, yielding 1573 samples across 14 categories. By evaluating ten models with a Bradley-Terry-based analysis and comparing against Chatbot Arena, the approach achieves high separability (84%) and strong agreement (84%) with human-aligned rankings (0.915 Spearman, 0.0417 Brier), outperforming prior benchmarks. An open-source evaluation tool enables fine-grained, category-level diagnosis of model performance, promoting transparency and practical domain-focused benchmarking for practitioners.

Abstract

Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 \cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84\%) across ten top-ranked models, and agreement (84\%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9\% better than Arena Hard and 20\% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

TL;DR

The work tackles the inadequacy of existing benchmarks to capture domain-specific and multilingual behavior of LLMs in chat-like settings. It introduces a refreshable data pipeline that converts unlabeled data into labeled, domain-diverse clusters using embeddings, seed labeling, and a -NN classifier, yielding 1573 samples across 14 categories. By evaluating ten models with a Bradley-Terry-based analysis and comparing against Chatbot Arena, the approach achieves high separability (84%) and strong agreement (84%) with human-aligned rankings (0.915 Spearman, 0.0417 Brier), outperforming prior benchmarks. An open-source evaluation tool enables fine-grained, category-level diagnosis of model performance, promoting transparency and practical domain-focused benchmarking for practitioners.

Abstract

Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 \cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84\%) across ten top-ranked models, and agreement (84\%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9\% better than Arena Hard and 20\% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.
Paper Structure (21 sections, 8 figures, 4 tables)

This paper contains 21 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Compared to other benchmark frameworks our approach introduces a data pipeline that curates unlabeled data into categories that contain domains/capabilities that the practitioner cares about. It has the capability to be refreshed with new data and is diverse compared to alternatives.
  • Figure 2: Alpaca-Eval category breakdown
  • Figure 3: Arena-Hard v0.1 category breakdown
  • Figure 4: Visual comparison between our method, Arena-Hard v0.1, and Alpaca-Eval 2.0 LC on 10 models on separability of winrates. Our method has fewer overlaps of confidence intervals than the other baselines.
  • Figure 5: Data pipeline: After aggregating the prompts from datasets, we generate embeddings using a text embedding model. We set aside a set of prompts to use as a seed set for training the k-NN, label them into each category we care about, and generate their corresponding embeddings to train the k-NN with the embedding model. Subsequently, we classify the unlabeled data with our trained k-NN to create clusters of categories. We balance the clusters with stratified sampling and then manually curate the remaining prompts by removing overly long prompts (greater than 5000 words) and checking for low-quality content (nonsense prompts, NSFW etc.) to obtain the final evaluation set.
  • ...and 3 more figures