Table of Contents
Fetching ...

AI Text Detectors and the Misclassification of Slightly Polished Arabic Text

Saleh Almohaimeed, Saad Almohaimeed, Mousa Jari, Khaled A. Alobaid, Fahad Alotaibi

TL;DR

The paper investigates how slight AI polishing of human-authored Arabic text can mislead AI detectors, causing false attribution of AI authorship. It introduces two datasets (800-sample Arabic AI/human corpus and Ar-APT with 16400 polished samples) and assesses 10 LLMs plus four commercial detectors to measure detection accuracy and robustness to polishing. Findings show widespread misclassification, with the best detectors still failing under polishing and commercial detectors being especially fragile; this motivates developing Arabic-focused detectors. The work highlights the practical impact on credibility and proposes dataset and methodological directions to improve detection fairness for Arabic text.

Abstract

Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it as AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detectors. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human-authored text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51\%, its performance decreased to 57.63\% for articles slightly polished by LLaMA-3. Whereas the best performing commercial model, originality.AI, achieves 92\% accuracy, dropped to 12\% for articles slightly polished by Mistral or Gemma-3.

AI Text Detectors and the Misclassification of Slightly Polished Arabic Text

TL;DR

The paper investigates how slight AI polishing of human-authored Arabic text can mislead AI detectors, causing false attribution of AI authorship. It introduces two datasets (800-sample Arabic AI/human corpus and Ar-APT with 16400 polished samples) and assesses 10 LLMs plus four commercial detectors to measure detection accuracy and robustness to polishing. Findings show widespread misclassification, with the best detectors still failing under polishing and commercial detectors being especially fragile; this motivates developing Arabic-focused detectors. The work highlights the practical impact on credibility and proposes dataset and methodological directions to improve detection fairness for Arabic text.

Abstract

Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it as AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detectors. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human-authored text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51\%, its performance decreased to 57.63\% for articles slightly polished by LLaMA-3. Whereas the best performing commercial model, originality.AI, achieves 92\% accuracy, dropped to 12\% for articles slightly polished by Mistral or Gemma-3.

Paper Structure

This paper contains 14 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: This is an example of an Arabic human-written article that has been polished twice under 10% and 50% polishing settings. Words with red colors are polished words in comparison to the human-written text
  • Figure 2: AI-text detection rate for GPT-4o with 5 different polishing settings.
  • Figure 3: AI-text detection rate for Deepseek 3.1 with 5 different polishing settings.
  • Figure 4: AI-text detection rate for Mistral with 5 different polishing settings.
  • Figure 5: AI-text detection rate for Kimi K2 with 5 different polishing settings.
  • ...and 4 more figures