Table of Contents
Fetching ...

AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models

Kristen M. Edwards, Farnaz Tehranchi, Scarlett R. Miller, Faez Ahmed

TL;DR

This work tackles the problem of scalable, credible design-evaluation by introducing a rigorous statistical framework to test AI judges against human experts. It employs four in-context, reasoning-enabled vision-language judge configurations (No Context, Text, Text+Image, Text+Image+Reasoning) and a comprehensive suite of analyses—including Kappa, ICC, MAE, Bland-Altman, Friedman/Wilcoxon, Spearman, TOST, and Jaccard top-set overlap—to establish expert equivalence across four subjective design metrics. The case study shows that a reasoning-supported, multimodal AI judge often achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across several metrics, though usefulness remains challenging to emulate. The findings demonstrate that expert-equivalent subjective evaluation is achievable with in-context prompting and rigorous multi-metric validation, enabling scalable, cost-effective design critique and offering a template for validating AI judgments in other subjective domains.

Abstract

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.

AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models

TL;DR

This work tackles the problem of scalable, credible design-evaluation by introducing a rigorous statistical framework to test AI judges against human experts. It employs four in-context, reasoning-enabled vision-language judge configurations (No Context, Text, Text+Image, Text+Image+Reasoning) and a comprehensive suite of analyses—including Kappa, ICC, MAE, Bland-Altman, Friedman/Wilcoxon, Spearman, TOST, and Jaccard top-set overlap—to establish expert equivalence across four subjective design metrics. The case study shows that a reasoning-supported, multimodal AI judge often achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across several metrics, though usefulness remains challenging to emulate. The findings demonstrate that expert-equivalent subjective evaluation is achievable with in-context prompting and rigorous multi-metric validation, enabling scalable, cost-effective design critique and offering a template for validating AI judgments in other subjective domains.

Abstract

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.

Paper Structure

This paper contains 28 sections, 1 equation, 6 figures, 16 tables.

Figures (6)

  • Figure 1: The in-context learning (ICL) workflow utilized to develop our four AI judges.
  • Figure 2: AI Judge: Text + Image + Reasoning reaches expert-level AUC, as does one trained novice, all other models do not.
  • Figure 3: Bland-Altman plots comparing AI judges' ratings to expert ratings. (a) The AI Judge with No Context consistently assigns higher uniqueness ratings than Expert 2. (b) The AI Judge with Reasoning shows minimal bias. Both instances reveal larger differences in the middle range of the mean ratings.
  • Figure 4: Two of the AI judges, shown in blue, reach the same AUC as Expert 1 vs. Expert 2, while the trained novices do not.
  • Figure 5: AI Judge: Text + Image + Reasoning and Trained Novice 2 come closest to expert-level AUC of 0.67, with their AUC of 0.62. In general, for Usefulness, expert-level Jaccard similarity is hard for novices and AI judges to meet.
  • ...and 1 more figures