Table of Contents
Fetching ...

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Toru Sasaki, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov

TL;DR

<3-5 sentence high-level summary> The paper introduces M4, a large-scale benchmark for black-box machine-generated text detection that spans multiple generators, domains, and languages. It systematically evaluates seven detectors across diverse settings to reveal generalization gaps when detectors face unseen domains, generators, or languages. Through prompt diversity, minimal cleaning, and careful quality control, M4 provides both parallel and non-parallel data and robust train/dev/test splits, enabling thorough cross-domain, cross-generator, multilingual, and temporal analyses. The findings highlight strong in-domain performance yet limited cross-domain robustness, underscoring the need for more generalizable detection methods and ongoing dataset expansion to keep pace with evolving LLMs.

Abstract

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark \textbf{M4}, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4.

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

TL;DR

<3-5 sentence high-level summary> The paper introduces M4, a large-scale benchmark for black-box machine-generated text detection that spans multiple generators, domains, and languages. It systematically evaluates seven detectors across diverse settings to reveal generalization gaps when detectors face unseen domains, generators, or languages. Through prompt diversity, minimal cleaning, and careful quality control, M4 provides both parallel and non-parallel data and robust train/dev/test splits, enabling thorough cross-domain, cross-generator, multilingual, and temporal analyses. The findings highlight strong in-domain performance yet limited cross-domain robustness, underscoring the need for more generalizable detection methods and ongoing dataset expansion to keep pace with evolving LLMs.

Abstract

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark \textbf{M4}, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4.
Paper Structure (64 sections, 4 figures, 17 tables)

This paper contains 64 sections, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Accuracy of cross-domain experiments: given generations from ChatGPT (top) or davinci (bottom), train on a single domain and test across domains across five detectors. (see more detail in Tables \ref{['tab:chatgpt']} and \ref{['tab:davinci']})
  • Figure 2: Accuracy of cross-generator experiments: train and test on arXiv (top) and Wikipedia (bottom) across five detectors, over single machine-text generator vs. human. (see detail in Tables \ref{['tab:arxiv-1']} and \ref{['tab:wikipedia-1']})
  • Figure 3: Impact of text length on detection accuracy over arXiv and Reddit generated by ChatGPT, davinci, and Cohere.
  • Figure 4: Visualization of the features extracted by LIME for Reddit as a domain, and ChatGPT as a generator.