Table of Contents
Fetching ...

CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages

Dominik Macko, Jakub Kopal

TL;DR

This study introduces CEAID, a benchmark for detecting machine-generated text in Central European languages across multiple domains (News and Social) and generators (eight LLMs). It systematically compares statistical, pretrained, and finetuned detectors, and analyzes how training-language composition affects cross-lingual generalization. The findings show finetuned detectors deliver the best overall performance and adversarial robustness, while cross-lingual transfer benefits from diverse, multi-language training data. The work underscores the importance of language-specific resources and rigorous cross-lingual evaluation to strengthen content authenticity in Central Europe.

Abstract

Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.

CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages

TL;DR

This study introduces CEAID, a benchmark for detecting machine-generated text in Central European languages across multiple domains (News and Social) and generators (eight LLMs). It systematically compares statistical, pretrained, and finetuned detectors, and analyzes how training-language composition affects cross-lingual generalization. The findings show finetuned detectors deliver the best overall performance and adversarial robustness, while cross-lingual transfer benefits from diverse, multi-language training data. The work underscores the importance of language-specific resources and rigorous cross-lingual evaluation to strengthen content authenticity in Central Europe.

Abstract

Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.

Paper Structure

This paper contains 17 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Central European region as defined by bideleux2007history.
  • Figure 2: Detected topics in the selected dataset.
  • Figure 3: Detected genres in the selected dataset.