Table of Contents
Fetching ...

EvalAssist: A Human-Centered Tool for LLM-as-a-Judge

Zahra Ashktorab, Werner Geyer, Michael Desmond, Elizabeth M. Daly, Martin Santillan Cooper, Qian Pan, Erik Miehling, Tejaswini Pedapati, Hyo Jin Do

TL;DR

EvalAssist tackles the costly challenge of evaluating outputs from multiple LLMs by providing a human-centered, criterion-driven framework that supports direct assessment and pairwise comparison through a web UI and UNITXT-based API. It isolates generation from evaluation, enables multi-evaluator workflows, and integrates bias indicators to improve trust. A key contribution is the integration of prompt-chaining evaluation pipelines and open-source tooling (UNITXT) to export reproducible notebooks for large-scale analysis, including specialized judges for harms and risks. The work demonstrates practical impact by deploying with hundreds of users and showing preferred evaluation modes differ by task type, highlighting flexible, transparent evaluation as essential for reliable LLM-mediated judgments.

Abstract

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. We support a set of LLM-based evaluation pipelines that leverage off-the-shelf LLMs and use a prompt-chaining approach we developed and contributed to the UNITXT open-source library. Additionally, our system also includes specially trained evaluators to detect harms and risks in LLM outputs. We have deployed the system internally in our organization with several hundreds of users.

EvalAssist: A Human-Centered Tool for LLM-as-a-Judge

TL;DR

EvalAssist tackles the costly challenge of evaluating outputs from multiple LLMs by providing a human-centered, criterion-driven framework that supports direct assessment and pairwise comparison through a web UI and UNITXT-based API. It isolates generation from evaluation, enables multi-evaluator workflows, and integrates bias indicators to improve trust. A key contribution is the integration of prompt-chaining evaluation pipelines and open-source tooling (UNITXT) to export reproducible notebooks for large-scale analysis, including specialized judges for harms and risks. The work demonstrates practical impact by deploying with hundreds of users and showing preferred evaluation modes differ by task type, highlighting flexible, transparent evaluation as essential for reliable LLM-mediated judgments.

Abstract

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. We support a set of LLM-based evaluation pipelines that leverage off-the-shelf LLMs and use a prompt-chaining approach we developed and contributed to the UNITXT open-source library. Additionally, our system also includes specially trained evaluators to detect harms and risks in LLM outputs. We have deployed the system internally in our organization with several hundreds of users.

Paper Structure

This paper contains 11 sections, 7 figures.

Figures (7)

  • Figure 1: EvalAssist landing page with test case catalog on the left and different evaluation strategies to choose from in the center.
  • Figure 2: Evaluation Criteria Form for Pairwise Comparison. Variables created in the task context can be referenced in the criteria definition.
  • Figure 3: Evaluation Criteria Form for Direct Assessment. Variables created in the task context can be referenced in the criteria definition and in the options.
  • Figure 4: Task Context for a Summarization Task. The Task Context is consistent for both direct assessment and pairwise comparison strategies. Users have the option to break down the context into variables, such as the instruction and article, to simplify and unify references when developing evaluation criteria.
  • Figure 5: Results for direct assessment. Users can select their expected judgments for the output, which are auto-populated based on the criteria they define (i.e., the scale items created when setting the criteria). The results display the AI Evaluator's judgments, indicating whether there is agreement between the user and the AI, along with explanations for each result.
  • ...and 2 more figures