Table of Contents
Fetching ...

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, Juho Kim

TL;DR

EvalLM introduces an interactive system that enables prompt designers to iteratively refine LLM prompts by evaluating outputs against user-defined, application-specific criteria using an LLM-based evaluation assistant and a criteria reviewer. Formative interviews reveal that designers rely on manual, multi-faceted, and dynamic evaluation, which is costly and hard to scale. In a comparative user study, EvalLM enabled broader and deeper evaluation, faster prompt refinements, and higher satisfaction with criteria, reducing the number of revisions by 59% compared to manual evaluation. The work contributes a practical, collaborative framework for prompt design and points toward extending these evaluation methods to model evaluation and alignment in real-world tasks.

Abstract

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

TL;DR

EvalLM introduces an interactive system that enables prompt designers to iteratively refine LLM prompts by evaluating outputs against user-defined, application-specific criteria using an LLM-based evaluation assistant and a criteria reviewer. Formative interviews reveal that designers rely on manual, multi-faceted, and dynamic evaluation, which is costly and hard to scale. In a comparative user study, EvalLM enabled broader and deeper evaluation, faster prompt refinements, and higher satisfaction with criteria, reducing the number of revisions by 59% compared to manual evaluation. The work contributes a practical, collaborative framework for prompt design and points toward extending these evaluation methods to model evaluation and alignment in real-world tasks.

Abstract

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.
Paper Structure (61 sections, 5 equations, 10 figures, 3 tables)

This paper contains 61 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: EvalLM is composed of three main panels: generation, data, and evaluation. In the generation panel, the user can compose the overall instructions for their task (A), two prompt templates they want to compare (B), and sample inputs from their dataset (C). To evaluate outputs, the user first defines their criteria set (D) and can see an overview of evaluation results (E). If the user has added samples to their validation set, they can also check the accuracy of the evaluations in this panel (F). The data panel shows a series of rows, where each row presents an input sample, the outputs generated on this input, and the evaluation results for these outputs.
  • Figure 2: For each prompt in EvalLM, the user can provide it a unique name (A), and compose both the system (D) and user prompt (E). If the user wants to test different pairs of prompts, they can add new prompts, (B) or switch to previous prompts through the browse button (C), which opens a panel listing all of the prompts that they have created.
  • Figure 3: For each criterion in EvalLM, the user provides a name (A) and a description (D). Each criterion is automatically assigned a color to help with identification. If the criteria review tool identifies improvements for the criteria, these are shown as badges (B) that the user can click to see the suggested revisions (E). Clicking on these suggestions adds them to the criteria set.
  • Figure 4: Rows in the data panel show the input sample (A), the outputs generated from the pair of prompts (B), and the evaluation results on each defined criteria (C). For each criterion, the evaluation shows three circles that respectively represent that the first prompt won, there was a tie, or the second prompt won. If a question mark is shown over a circle, this indicates that there is uncertainty in the evaluation. If only one evaluation trial was run, this indicates that a small score difference between outputs and, if multiple trials were run, that at least one trial returned a different result. The user can click on an evaluation to see the explanation (D) and highlights on the portions of the output that were relevant to that evaluation (E). If the user conducted multiple evaluation trials, they can also browse through the other trials by using the carousel at the bottom (F).
  • Figure 5: The history visualization is separated into sessions, which represent sets of samples that were generated and evaluated with the same prompts and criteria. For each session, the history shows the names of the prompts (A) and criteria (B) used, and the user can click on these to see their content at the time (C). For each criterion, the history shows a bar for each sample evaluated (D), which is color-coded to represent which prompt won or if there was a tie for that sample.
  • ...and 5 more figures