Table of Contents
Fetching ...

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

TL;DR

This paper presents the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages and reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches.

Abstract

Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

TL;DR

This paper presents the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages and reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches.

Abstract

Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.

Paper Structure

This paper contains 35 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: TextDetoxEval: Correlation of fluency measurement approaches with human-annotated fluency scores.
  • Figure 2: TextDetoxEval: Correlation of content similarity measurement approaches with human-annotated content preservation scores.
  • Figure 3: TextDetoxEval: Correlation of toxicity measurement approaches with target pairwise toxic human annotated scores.
  • Figure 4: TextDetoxEval: Correlation final scores with target joined scores from human annotation.
  • Figure 5: TextDetoxEval: Comparison between XCOMET-LITE and different LLMs on the fluency scores from human annotation.
  • ...and 3 more figures