Table of Contents
Fetching ...

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov

TL;DR

The study challenges the view that humans cannot reliably distinguish human- from AI-generated text by presenting a multilingual, cross-domain evaluation with $16$ datasets across $9$ languages and $9$ domains. Using $19$ expert annotators and $11$ SOTA LLMs, the authors report an average detection accuracy of $87.6\%$, identify five robust linguistic signals that separate human and AI text, and show that prompting to reveal gap patterns can bridge the gap in over half of cases. They also reveal nuanced human preferences, showing that people do not always favor human-written text when the source is unclear, and demonstrate that prompting can reduce detectability, complicating the reliability of automated MGT detectors. The work contributes large multilingual data, analyses of prompting effects, and insights for multilingual LLM alignment, while highlighting the need for broader participant diversity and automated linguistic analyses to advance practical detection and user-aligned generation systems.

Abstract

Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6\%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50\% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

TL;DR

The study challenges the view that humans cannot reliably distinguish human- from AI-generated text by presenting a multilingual, cross-domain evaluation with datasets across languages and domains. Using expert annotators and SOTA LLMs, the authors report an average detection accuracy of , identify five robust linguistic signals that separate human and AI text, and show that prompting to reveal gap patterns can bridge the gap in over half of cases. They also reveal nuanced human preferences, showing that people do not always favor human-written text when the source is unclear, and demonstrate that prompting can reduce detectability, complicating the reliability of automated MGT detectors. The work contributes large multilingual data, analyses of prompting effects, and insights for multilingual LLM alignment, while highlighting the need for broader participant diversity and automated linguistic analyses to advance practical detection and user-aligned generation systems.

Abstract

Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6\%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50\% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.

Paper Structure

This paper contains 99 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Evaluating whether the new generations fill in the gap: Yes, Partially, or No.
  • Figure 2: Human detection accuracy dropped from 87.6 to 72.5 for the original vs. the improved generations.
  • Figure 3: Human preferences for three Chinese datasets (five annotators): QA-emo is an emotion-rich question subset of Zhihu-QA with 100 examples.
  • Figure 4: Human preferences for two Russian (three annotators) and two Arabic datasets (two annotators).
  • Figure 5: Three annotator agreement on Chinese essays regarding whether the improved prompts mitigate the gap between human text and machine-generated text.
  • ...and 1 more figures