Table of Contents
Fetching ...

Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria

Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A. Metoyer, Toby Jia-Jun Li

TL;DR

This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow, and proposes design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.

Abstract

Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we propose design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.

Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria

TL;DR

This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow, and proposes design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.

Abstract

Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we propose design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.
Paper Structure (35 sections, 2 equations, 4 figures, 4 tables)

This paper contains 35 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The process followed for creating and refining evaluation criteria for LLMs. In step (1), participants (domain experts or lay users) are presented with a prompt to review. In step (2), participants create initial criteria. In step (3), participants review three outputs generated by LLMs. In step (4), each participant subsequently adds or refines criteria based on review of the outputs.
  • Figure 2: Workflow for Creating and Refining Evaluation Criteria in MAXQDA. The left panel shows the a priori phase, where participants (A) review a scenario-specific prompt and (B) create and store initial evaluation criteria. The right panel highlights the a posteriori phase, where participants (C) review outputs generated by the LLMs, (D) either refine or create additional criteria, and (E) tag criteria to parts of the output during the a posteriori phase.
  • Figure 3: Heat map of the average number of criteria for each domain created by experts, lay users, and the LLMs at both the a priori and a posteriori phases.
  • Figure 4: Distribution of criteria counts generated in the a priori and a posteriori phases across participant types (Domain Experts, Lay Users, and LLMs) in the Nutrition and Math Pedagogy domains. Means are labeled within each distribution.