Table of Contents
Fetching ...

MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

Nalin Srun, Parisa Rastin, Guénaël Cabanes, Lydia Boudjeloud Assala

TL;DR

MILE-RefHumEval tackles the challenge of evaluating LLMs without ground-truth references by deploying an ensemble of independently prompted evaluators guided by a human-aligned schema. It supports both discrete and continuous judgments across modalities (text and vision) and computes final scores via majority voting or averaging, mitigating interaction and consensus biases. Across benchmarks such as FairEval, SummEval, OID, PandaLM, and Topical-Chat, the framework achieves strong alignment with human judgments and reduced query overhead relative to baselines like CHATEVAL, demonstrating the value of evaluator diversity and non-conversational assessment. The work highlights practical implications for scalable, robust, and interpretable LLM evaluation, while outlining limitations around domain generalization, prompt sensitivity, and optimal evaluator count, with clear directions for future enhancements involving broader modalities and domain-specific evaluators. $MSE$, $RMSE$, and $MAE$ improvements illustrate tighter agreement with human annotations, while $Acc$, $F1$, $MCC$, $Kap$ demonstrate robust discrete evaluation.

Abstract

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.

MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

TL;DR

MILE-RefHumEval tackles the challenge of evaluating LLMs without ground-truth references by deploying an ensemble of independently prompted evaluators guided by a human-aligned schema. It supports both discrete and continuous judgments across modalities (text and vision) and computes final scores via majority voting or averaging, mitigating interaction and consensus biases. Across benchmarks such as FairEval, SummEval, OID, PandaLM, and Topical-Chat, the framework achieves strong alignment with human judgments and reduced query overhead relative to baselines like CHATEVAL, demonstrating the value of evaluator diversity and non-conversational assessment. The work highlights practical implications for scalable, robust, and interpretable LLM evaluation, while outlining limitations around domain generalization, prompt sensitivity, and optimal evaluator count, with clear directions for future enhancements involving broader modalities and domain-specific evaluators. , , and improvements illustrate tighter agreement with human annotations, while , , , demonstrate robust discrete evaluation.

Abstract

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
Paper Structure (15 sections, 17 figures, 6 tables)

This paper contains 15 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overall Framework of MILE-RefHumEval.
  • Figure 2: Evaluation prompt used in MILE-RefHumEval for pairwise comparison of two candidate answers (PandaLM Benchmark).
  • Figure 3: Evaluation prompt used in MILE-RefHumEval for the Image Captioning benchmark to assess LLM-generated captions (OID Rated Image Caption Benchmark).
  • Figure 4: Evaluation prompt used in MILE-RefHumEval for the Summarization benchmark to assess LLM-generated summaries (SummEval Benchmark).
  • Figure 5: Evaluation prompt used in MILE-RefHumEval for the Topical Chat benchmark to assess LLM-generated conversational responses (Topical Chat Benchmark).
  • ...and 12 more figures