CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages
Dominik Macko, Jakub Kopal
TL;DR
This study introduces CEAID, a benchmark for detecting machine-generated text in Central European languages across multiple domains (News and Social) and generators (eight LLMs). It systematically compares statistical, pretrained, and finetuned detectors, and analyzes how training-language composition affects cross-lingual generalization. The findings show finetuned detectors deliver the best overall performance and adversarial robustness, while cross-lingual transfer benefits from diverse, multi-language training data. The work underscores the importance of language-specific resources and rigorous cross-lingual evaluation to strengthen content authenticity in Central Europe.
Abstract
Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.
