Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games
Andrés Isaza-Giraldo, Paulo Bala, Lucas Pereira
TL;DR
This work tackles the challenge of evaluating open-ended player responses in serious games by testing five small, local LLMs as evaluators in En-join. Using traditional binary metrics ($TPR$, $TNR$, $PPV$, $NPV$, $F1$) across three levels, the study reveals substantial variation in accuracy and reliability among models, with Phi-4 and Qwen delivering the strongest overall performance and others showing trade-offs between sensitivity and specificity. The results highlight context-dependent strengths and weaknesses, and suggest ensemble approaches (e.g., combining $Qwen$ and $Phi ext{-}4$) to balance positive and negative predictions. The findings underscore the need for task-specific, human-centered evaluation frameworks when deploying LLM-based evaluators in energy-education serious games, and point to practical guidance for model selection and prompting strategies in real-world deployments.
Abstract
The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textit{En-join}, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.
