Table of Contents
Fetching ...

PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

Dominik Macko, Andrew Pulver

TL;DR

The paper tackles the high cost and bias inherent in using multiple large language models to evaluate text personalization quality, especially in multilingual contexts. It proposes PerQ, a reference-free, resource-efficient metric trained as a multiclass classifier on majority-voted scores from three diverse LLMs, with a large, multilingual training corpus and balanced data splits. The results show PerQ achieves high accuracy (well above chance) and strong correlation to meta-evaluations, while dramatically reducing inference time and memory usage compared to LLM-based meta-evaluation. The approach enables rapid, scalable assessment of personalization quality across languages and platforms, with potential for extending to other text-evaluation facets and scales.

Abstract

Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.

PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

TL;DR

The paper tackles the high cost and bias inherent in using multiple large language models to evaluate text personalization quality, especially in multilingual contexts. It proposes PerQ, a reference-free, resource-efficient metric trained as a multiclass classifier on majority-voted scores from three diverse LLMs, with a large, multilingual training corpus and balanced data splits. The results show PerQ achieves high accuracy (well above chance) and strong correlation to meta-evaluations, while dramatically reducing inference time and memory usage compared to LLM-based meta-evaluation. The approach enables rapid, scalable assessment of personalization quality across languages and platforms, with potential for extending to other text-evaluation facets and scales.

Abstract

Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.

Paper Structure

This paper contains 10 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Framework of the proposed metric training and inference.
  • Figure 2: Comparison of LLMs personalization capabilities based on the majority metaevaluation scores (top) and the Gemma-based PerQ metric (bottom) for quality of personalization in the test-split texts.
  • Figure 3: Comparison of personalization types based on the majority metaevaluation scores (top) and the Gemma-based PerQ metric (bottom) for quality of personalization in the test-split texts.
  • Figure 4: Per-language comparison of personalization capabilities based on the majority metaevaluation scores (top) and the Gemma-based PerQ metric (bottom) for quality of personalization in the test-split texts.
  • Figure 5: Per-platform (i.e., per-target) comparison of personalization capabilities based on the majority metaevaluation scores (top) and the Gemma-based PerQ metric (bottom) for quality of personalization in the test-split texts.
  • ...and 4 more figures