Table of Contents
Fetching ...

IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated Academic Essays in English and Arabic Using ELECTRA and Stylometry

Mohammad AL-Smadi

TL;DR

This work addresses the challenge of detecting machine-generated academic essays in English and Arabic by fine-tuning ELECTRA-based detectors (ELECTRA for English and AraELECTRA for Arabic) with stylometric features. The authors evaluate on a bilingual dataset comprising AI- and human-authored essays, using a three-phase GenAI Content Detection Task 2 setup, and compare against a unigram TF-IDF/SVM baseline. The proposed IntegrityAI models achieve exceptionally high F1-scores in both languages (up to 100% in evaluation and up to 98.5% in testing), with stylometric features providing a notable boost and ELECTRA-Large offering an additional performance gain for English at higher compute cost. The results demonstrate strong generalization and suggest practical deployment potential, while highlighting tradeoffs between accuracy and resources and outlining directions for real-time detection, broader domains, and expanded language coverage.

Abstract

Recent research has investigated the problem of detecting machine-generated essays for academic purposes. To address this challenge, this research utilizes pre-trained, transformer-based models fine-tuned on Arabic and English academic essays with stylometric features. Custom models based on ELECTRA for English and AraELECTRA for Arabic were trained and evaluated using a benchmark dataset. Proposed models achieved excellent results with an F1-score of 99.7%, ranking 2nd among of 26 teams in the English subtask, and 98.4%, finishing 1st out of 23 teams in the Arabic one.

IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated Academic Essays in English and Arabic Using ELECTRA and Stylometry

TL;DR

This work addresses the challenge of detecting machine-generated academic essays in English and Arabic by fine-tuning ELECTRA-based detectors (ELECTRA for English and AraELECTRA for Arabic) with stylometric features. The authors evaluate on a bilingual dataset comprising AI- and human-authored essays, using a three-phase GenAI Content Detection Task 2 setup, and compare against a unigram TF-IDF/SVM baseline. The proposed IntegrityAI models achieve exceptionally high F1-scores in both languages (up to 100% in evaluation and up to 98.5% in testing), with stylometric features providing a notable boost and ELECTRA-Large offering an additional performance gain for English at higher compute cost. The results demonstrate strong generalization and suggest practical deployment potential, while highlighting tradeoffs between accuracy and resources and outlining directions for real-time detection, broader domains, and expanded language coverage.

Abstract

Recent research has investigated the problem of detecting machine-generated essays for academic purposes. To address this challenge, this research utilizes pre-trained, transformer-based models fine-tuned on Arabic and English academic essays with stylometric features. Custom models based on ELECTRA for English and AraELECTRA for Arabic were trained and evaluated using a benchmark dataset. Proposed models achieved excellent results with an F1-score of 99.7%, ranking 2nd among of 26 teams in the English subtask, and 98.4%, finishing 1st out of 23 teams in the Arabic one.
Paper Structure (9 sections, 3 figures, 4 tables)

This paper contains 9 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The architecture of the ELECTRA-based models with stylometric features.
  • Figure 2: Confusion matrices on the validation sets (Arabic dataset on the left).
  • Figure 3: Training vs. validation loss values after each epoch of models training (AraELECTRA the upper left corner, ELECTRA_small on the right upper corner, and ELECTRA_large on the left lower corner)