Table of Contents
Fetching ...

GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge

Shammur Absar Chowdhury, Hind Almerekhi, Mucahid Kutlu, Kaan Efe Keles, Fatema Ahmad, Tasnim Mohiuddin, George Mikros, Firoj Alam

TL;DR

The paper presents the first edition of the GenAI Content Detection Task 2, focusing on distinguishing AI-generated from human-authored academic essays in English and Arabic. It introduces the GRACE evaluation corpus, details data collection and generation methods (including freehand generation and paraphrasing via LLMs), and establishes a two-phase evaluation framework with a Transformer-centric trend among participating teams. The results show strong performance across languages, with top systems achieving macro-F1 scores surpassing 0.98, and highlight the value of hybrid approaches that combine transformer models with stylometric and linguistic features. Limitations include dataset size, particularly for Arabic, motivating future work on larger, more diverse corpora and expanded evaluation protocols to further advance robust detection in academic contexts.

Abstract

This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.

GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge

TL;DR

The paper presents the first edition of the GenAI Content Detection Task 2, focusing on distinguishing AI-generated from human-authored academic essays in English and Arabic. It introduces the GRACE evaluation corpus, details data collection and generation methods (including freehand generation and paraphrasing via LLMs), and establishes a two-phase evaluation framework with a Transformer-centric trend among participating teams. The results show strong performance across languages, with top systems achieving macro-F1 scores surpassing 0.98, and highlight the value of hybrid approaches that combine transformer models with stylometric and linguistic features. Limitations include dataset size, particularly for Arabic, motivating future work on larger, more diverse corpora and expanded evaluation protocols to further advance robust detection in academic contexts.

Abstract

This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.

Paper Structure

This paper contains 19 sections, 8 tables.