MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation
Nalin Srun, Parisa Rastin, Guénaël Cabanes, Lydia Boudjeloud Assala
TL;DR
MILE-RefHumEval tackles the challenge of evaluating LLMs without ground-truth references by deploying an ensemble of independently prompted evaluators guided by a human-aligned schema. It supports both discrete and continuous judgments across modalities (text and vision) and computes final scores via majority voting or averaging, mitigating interaction and consensus biases. Across benchmarks such as FairEval, SummEval, OID, PandaLM, and Topical-Chat, the framework achieves strong alignment with human judgments and reduced query overhead relative to baselines like CHATEVAL, demonstrating the value of evaluator diversity and non-conversational assessment. The work highlights practical implications for scalable, robust, and interpretable LLM evaluation, while outlining limitations around domain generalization, prompt sensitivity, and optimal evaluator count, with clear directions for future enhancements involving broader modalities and domain-specific evaluators. $MSE$, $RMSE$, and $MAE$ improvements illustrate tighter agreement with human annotations, while $Acc$, $F1$, $MCC$, $Kap$ demonstrate robust discrete evaluation.
Abstract
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
