Table of Contents
Fetching ...

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar, J. D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo

TL;DR

<3-5 sentence high-level summary> EvalGen tackles the problem of validating LLM-based evaluators used to judge LLM outputs by introducing a mixed-initiative interface embedded in ChainForge. It automatically proposes evaluation criteria and candidate implementations (code or LLM prompts) and uses human grading on a sample of outputs to select the most aligned assertions, revealing phenomena like criteria drift and output-dependent criteria. Offline evaluation against SPADE and a qualitative user study show that human-in-the-loop criterion generation can achieve equal or better alignment with fewer assertions, while also exposing challenges in trust, control, and iterative refinement. The work offers design principles and empirical insights for building practical, human-guided LLM evaluation assistants in real-world LLMOps pipelines.

Abstract

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

TL;DR

<3-5 sentence high-level summary> EvalGen tackles the problem of validating LLM-based evaluators used to judge LLM outputs by introducing a mixed-initiative interface embedded in ChainForge. It automatically proposes evaluation criteria and candidate implementations (code or LLM prompts) and uses human grading on a sample of outputs to select the most aligned assertions, revealing phenomena like criteria drift and output-dependent criteria. Offline evaluation against SPADE and a qualitative user study show that human-in-the-loop criterion generation can achieve equal or better alignment with fewer assertions, while also exposing challenges in trust, control, and iterative refinement. The work offers design principles and empirical insights for building practical, human-guided LLM evaluation assistants in real-world LLMOps pipelines.

Abstract

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
Paper Structure (41 sections, 3 equations, 4 figures, 2 tables)

This paper contains 41 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: EvalGen's approach to assisting users in aligning evaluations. Users iterate through the process of refining criteria and grading. Note that LLM pipeline inputs and outputs are provided by our larger system, and outside the scope of this paper.
  • Figure 2: The workflow of our EvalGen prototype, from (a) a Prompt Node attached to an empty Multi-Eval Node, showing a Generate Criteria button; (b) the pop-up EvalGen Wizard with three options, Infer, Manual, and Grade First; (c) the Pick Criteria screen, allowing users to describe criteria in natural language and toggle Code or LLM implementations; (d) the Grade screen, with the LLM output (top), input variables (left), and prompt (right), Good and Bad grade buttons, and an "I'm Tired" button (bottom-right) to finish; and finally (e) the Report Card screen, showing the alignment of each criteria and across criteria. Hovering over the alignment shows a confusion matrix. Note that some descriptions and elements have been clipped for space.
  • Figure 3: The Table View, showing inputs, LLM outputs, and evaluation results per criteria for the NER task (Sec. \ref{['sec:user-study']}).
  • Figure 4: Alignments for assertion sets that result from different policies to sample grades from the user. Each policy was tested across 10 trials, with each involving a sample of 16 LLM outputs. Randomly sampling LLM outputs for grading introduces significant variance in alignment across the entire dataset.