SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Ansar Aynetdinov, Alan Akbik
TL;DR
This work tackles the challenge of scalable, human-aligned evaluation for instruction-tuned LLMs by introducing SemScore, a semantic-textual-similarity (STS) based metric that compares model outputs to gold responses using cosine similarity over embeddings from the all-mpnet-base-v2 model. The authors benchmark SemScore against eight baseline metrics across 12 instruction-tuned LLMs and 252 instructions, reporting correlations with human judgments via Kendall $\tau$ and Pearson $r$, where SemScore achieves the strongest alignment. The study highlights the practicality of a simple, cost-effective evaluation approach that does not rely on proprietary LLM evaluators, while acknowledging limitations related to the STS model choice and dataset size. Overall, SemScore offers a reproducible, scalable solution for automated evaluation of instruction-tuned LLMs with significant implications for model comparison and development workflows.
Abstract
Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.
