Table of Contents
Fetching ...

Human-Centered Design Recommendations for LLM-as-a-Judge

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer

TL;DR

The paper addresses the inadequacy of reference-based NL evaluation for diverse and creative LLM outputs by proposing EvaluLLM, a human-in-the-loop LLM-as-a-judge tool. It reports a qualitative user study with eight domain experts to identify challenges and design needs for aligning evaluation criteria with human intent, including structured templates, interactive criterion refinement, and transparency and bias controls. The authors offer design recommendations and example features to enable efficient, transparent, and adaptable evaluation workflows that balance trust and cost. The work demonstrates that a human-assisted, customizable evaluation framework can enhance reliability and applicability of LLM-based judgments in practical settings.

Abstract

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.

Human-Centered Design Recommendations for LLM-as-a-Judge

TL;DR

The paper addresses the inadequacy of reference-based NL evaluation for diverse and creative LLM outputs by proposing EvaluLLM, a human-in-the-loop LLM-as-a-judge tool. It reports a qualitative user study with eight domain experts to identify challenges and design needs for aligning evaluation criteria with human intent, including structured templates, interactive criterion refinement, and transparency and bias controls. The authors offer design recommendations and example features to enable efficient, transparent, and adaptable evaluation workflows that balance trust and cost. The work demonstrates that a human-assisted, customizable evaluation framework can enhance reliability and applicability of LLM-based judgments in practical settings.

Abstract

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
Paper Structure (29 sections, 6 figures, 2 tables)

This paper contains 29 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: EvaluLLM interfaces and key features
  • Figure 2: Recommended evaluation workflow: interactive refinement of criteria with a subset of data prior to applying evaluation to entire dataset can potentially improve preference alignment and trust calibration.
  • Figure 3: Recommended design to (A) enable users to choose from a list of predefined custom metric modules and (B) enable users to create a set of evaluation criteria based on common use cases.
  • Figure 4: Recommended design to provide structured and customizable templates that support hierarchical, multi-dimensional evaluations.
  • Figure 5: Recommended design demonstrating the ability of users to leverage LLM-as-a-Judge for Criteria Iteration.
  • ...and 1 more figures