Table of Contents
Fetching ...

Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

Hyo Jin Do, Zahra Ashktorab, Jasmina Gajcin, Erik Miehling, Martín Santillán Cooper, Qian Pan, Elizabeth M. Daly, Werner Geyer

TL;DR

This work addresses the paucity of diverse data for refining LLM-based evaluation criteria by introducing EvalAssist, a tool that integrates synthetic data generation into the LLM-as-a-judge workflow. It enables configurable domains, personas, lengths, and borderline cases, along with inline AI-assisted editing and transparent prompts. In a mixed-method study with 24 practitioners, the synthetic data tool was preferred for speed and diversity, and its data supported criteria refinement as effectively as hand-crafted data. The results suggest synthetic data can scale evaluation workflows without increasing workload and holds promise for improving human-LLM alignment in settings requiring rapid, diverse test cases.

Abstract

The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both refining evaluation criteria and aligning with human preferences. These findings highlight synthetic data as a promising alternative, particularly in contexts where efficiency and scalability are critical.

Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

TL;DR

This work addresses the paucity of diverse data for refining LLM-based evaluation criteria by introducing EvalAssist, a tool that integrates synthetic data generation into the LLM-as-a-judge workflow. It enables configurable domains, personas, lengths, and borderline cases, along with inline AI-assisted editing and transparent prompts. In a mixed-method study with 24 practitioners, the synthetic data tool was preferred for speed and diversity, and its data supported criteria refinement as effectively as hand-crafted data. The results suggest synthetic data can scale evaluation workflows without increasing workload and holds promise for improving human-LLM alignment in settings requiring rapid, diverse test cases.

Abstract

The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both refining evaluation criteria and aligning with human preferences. These findings highlight synthetic data as a promising alternative, particularly in contexts where efficiency and scalability are critical.

Paper Structure

This paper contains 49 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Synthetic data generation tool in EvalAssist.
  • Figure 2: Screenshot of the full EvalAssist interface for Bias evaluation task
  • Figure 3: Explanation of a synthetic data instance . When a user hovers over a specific test instance, a "View explanation" button beneath the generated result will appear. Clicking the button will open a pop-up window that displays the evaluation rationale and the prompt used to generate the instance. Explanation shortened for readability.