SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Ansar Aynetdinov; Alan Akbik

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Ansar Aynetdinov, Alan Akbik

TL;DR

This work tackles the challenge of scalable, human-aligned evaluation for instruction-tuned LLMs by introducing SemScore, a semantic-textual-similarity (STS) based metric that compares model outputs to gold responses using cosine similarity over embeddings from the all-mpnet-base-v2 model. The authors benchmark SemScore against eight baseline metrics across 12 instruction-tuned LLMs and 252 instructions, reporting correlations with human judgments via Kendall $\tau$ and Pearson $r$, where SemScore achieves the strongest alignment. The study highlights the practicality of a simple, cost-effective evaluation approach that does not rely on proprietary LLM evaluators, while acknowledging limitations related to the STS model choice and dataset size. Overall, SemScore offers a reproducible, scalable solution for automated evaluation of instruction-tuned LLMs with significant implications for model comparison and development workflows.

Abstract

Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

TL;DR

and Pearson

, where SemScore achieves the strongest alignment. The study highlights the practicality of a simple, cost-effective evaluation approach that does not rely on proprietary LLM evaluators, while acknowledging limitations related to the STS model choice and dataset size. Overall, SemScore offers a reproducible, scalable solution for automated evaluation of instruction-tuned LLMs with significant implications for model comparison and development workflows.

Abstract

Paper Structure (14 sections, 1 figure, 6 tables)

This paper contains 14 sections, 1 figure, 6 tables.

Introduction
Human-Judged Ranking of LLMs
Comparison of Evaluation Metrics
Baseline Metrics
Proposed Metric
Comparing Rankings
Results and Discussion
Related Work
Conclusion
Appendix
Per-task correlations
Examples
G-Eval Prompt
Human annotators

Figures (1)

Figure 1: Human evaluation of prominent LLMs, based on our study and the results of wang2023selfinstruct. From this, we derive a human-judged ranking of LLMs as basis for comparison of automated evaluation metrics.

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

TL;DR

Abstract

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Authors

TL;DR

Abstract

Table of Contents

Figures (1)