Table of Contents
Fetching ...

MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

Rhitabrat Pokharel, Ameeta Agrawal

TL;DR

MTQ-Eval introduces a multilingual text quality evaluation framework that learns to differentiate high- and low-quality text across 115 languages using synthetic data and Direct Preference Optimization. By automatically generating quality-preferring data and aligning base LLMs with a quality-oriented objective, MTQ-Eval enhances both intrinsic text-quality assessment and downstream tasks such as sentiment analysis and summarization. The approach yields measurable gains on MELA and Belebele datasets, with stronger improvements for high-resource languages and in low-resource settings for certain languages and scripts. The results demonstrate the practical potential of language-agnostic quality evaluation to generalize beyond task-specific metrics and to support broader multilingual NLP applications.

Abstract

The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.

MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

TL;DR

MTQ-Eval introduces a multilingual text quality evaluation framework that learns to differentiate high- and low-quality text across 115 languages using synthetic data and Direct Preference Optimization. By automatically generating quality-preferring data and aligning base LLMs with a quality-oriented objective, MTQ-Eval enhances both intrinsic text-quality assessment and downstream tasks such as sentiment analysis and summarization. The approach yields measurable gains on MELA and Belebele datasets, with stronger improvements for high-resource languages and in low-resource settings for certain languages and scripts. The results demonstrate the practical potential of language-agnostic quality evaluation to generalize beyond task-specific metrics and to support broader multilingual NLP applications.

Abstract

The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.

Paper Structure

This paper contains 32 sections, 1 equation, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Overview of the MTQ-Eval (1) dataset creation, (2) model training, and (3) evaluation.
  • Figure 2: Prompt used to obtain text quality ratings during DPO.
  • Figure 3: The training dataset format for DPO training.
  • Figure 4: F1 scores of supported languages
  • Figure 5: An example of prompt part of the DPO finetuning dataset
  • ...and 5 more figures