PerQ: Efficient Evaluation of Multilingual Text Personalization Quality
Dominik Macko, Andrew Pulver
TL;DR
The paper tackles the high cost and bias inherent in using multiple large language models to evaluate text personalization quality, especially in multilingual contexts. It proposes PerQ, a reference-free, resource-efficient metric trained as a multiclass classifier on majority-voted scores from three diverse LLMs, with a large, multilingual training corpus and balanced data splits. The results show PerQ achieves high accuracy (well above chance) and strong correlation to meta-evaluations, while dramatically reducing inference time and memory usage compared to LLM-based meta-evaluation. The approach enables rapid, scalable assessment of personalization quality across languages and platforms, with potential for extending to other text-evaluation facets and scales.
Abstract
Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.
