Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju; Swayambhoo Jain; Bo Li; Jonathan Li; Urmish Thakker

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish Thakker

TL;DR

The work tackles the inadequacy of existing benchmarks to capture domain-specific and multilingual behavior of LLMs in chat-like settings. It introduces a refreshable data pipeline that converts unlabeled data into labeled, domain-diverse clusters using embeddings, seed labeling, and a $k$-NN classifier, yielding 1573 samples across 14 categories. By evaluating ten models with a Bradley-Terry-based analysis and comparing against Chatbot Arena, the approach achieves high separability (84%) and strong agreement (84%) with human-aligned rankings (0.915 Spearman, 0.0417 Brier), outperforming prior benchmarks. An open-source evaluation tool enables fine-grained, category-level diagnosis of model performance, promoting transparency and practical domain-focused benchmarking for practitioners.

Abstract

Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 \cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84\%) across ten top-ranked models, and agreement (84\%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9\% better than Arena Hard and 20\% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

TL;DR

-NN classifier, yielding 1573 samples across 14 categories. By evaluating ten models with a Bradley-Terry-based analysis and comparing against Chatbot Arena, the approach achieves high separability (84%) and strong agreement (84%) with human-aligned rankings (0.915 Spearman, 0.0417 Brier), outperforming prior benchmarks. An open-source evaluation tool enables fine-grained, category-level diagnosis of model performance, promoting transparency and practical domain-focused benchmarking for practitioners.

Abstract

Paper Structure (21 sections, 8 figures, 4 tables)

This paper contains 21 sections, 8 figures, 4 tables.

Introduction
Related Work
Methodology
Data Sources
Data pipeline
Experimental Setup
Data pipeline details
LLM-as-a-Judge Details
Obtaining Confidence Intervals
Metrics
Results
Separability, Agreement with CI (95%), Pair Brier Score
Diversity
Category Separability
Using different judges
...and 6 more sections

Figures (8)

Figure 1: Compared to other benchmark frameworks our approach introduces a data pipeline that curates unlabeled data into categories that contain domains/capabilities that the practitioner cares about. It has the capability to be refreshed with new data and is diverse compared to alternatives.
Figure 2: Alpaca-Eval category breakdown
Figure 3: Arena-Hard v0.1 category breakdown
Figure 4: Visual comparison between our method, Arena-Hard v0.1, and Alpaca-Eval 2.0 LC on 10 models on separability of winrates. Our method has fewer overlaps of confidence intervals than the other baselines.
Figure 5: Data pipeline: After aggregating the prompts from datasets, we generate embeddings using a text embedding model. We set aside a set of prompts to use as a seed set for training the k-NN, label them into each category we care about, and generate their corresponding embeddings to train the k-NN with the embedding model. Subsequently, we classify the unlabeled data with our trained k-NN to create clusters of categories. We balance the clusters with stratified sampling and then manually curate the remaining prompts by removing overly long prompts (greater than 5000 words) and checking for low-quality content (nonsense prompts, NSFW etc.) to obtain the final evaluation set.
...and 3 more figures

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

TL;DR

Abstract

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Authors

TL;DR

Abstract

Table of Contents

Figures (8)