Table of Contents
Fetching ...

Team QUST at SemEval-2024 Task 8: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting AI-generated Text

Xiaoman Xu, Xiangrun Li, Taihang Wang, Jianxiang Tian, Ye Jiang

TL;DR

The paper addresses AI-generated text detection in SemEval-2024 Task 8 by developing monolingual and multilingual pipelines that blend data augmentation, MPU-based detection, fine-tuning, adapters, and stacking. It demonstrates that data augmentation and ensemble methods substantially improve accuracy, with XLM-R and DeBERTa-based models performing best in multilingual settings and a stacking ensemble providing robust gains. The work reports competitive results, including an 8th-place multilingual subtask-A entry, and highlights the practical value of efficient fine-tuning approaches like LoRA and cross-lingual modeling for detecting machine-generated text. These findings underscore the effectiveness of combining data-centric and model-centric strategies to enhance ID of AI-generated content across languages, with implications for detecting misinformation and maintaining academic integrity.

Abstract

This paper presents the participation of team QUST in Task 8 SemEval 2024. We first performed data augmentation and cleaning on the dataset to enhance model training efficiency and accuracy. In the monolingual task, we evaluated traditional deep-learning methods, multiscale positive-unlabeled framework (MPU), fine-tuning, adapters and ensemble methods. Then, we selected the top-performing models based on their accuracy from the monolingual models and evaluated them in subtasks A and B. The final model construction employed a stacking ensemble that combined fine-tuning with MPU. Our system achieved 8th (scored 8th in terms of accuracy, officially ranked 13th) place in the official test set in multilingual settings of subtask A. We release our system code at:https://github.com/warmth27/SemEval2024_QUST

Team QUST at SemEval-2024 Task 8: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting AI-generated Text

TL;DR

The paper addresses AI-generated text detection in SemEval-2024 Task 8 by developing monolingual and multilingual pipelines that blend data augmentation, MPU-based detection, fine-tuning, adapters, and stacking. It demonstrates that data augmentation and ensemble methods substantially improve accuracy, with XLM-R and DeBERTa-based models performing best in multilingual settings and a stacking ensemble providing robust gains. The work reports competitive results, including an 8th-place multilingual subtask-A entry, and highlights the practical value of efficient fine-tuning approaches like LoRA and cross-lingual modeling for detecting machine-generated text. These findings underscore the effectiveness of combining data-centric and model-centric strategies to enhance ID of AI-generated content across languages, with implications for detecting misinformation and maintaining academic integrity.

Abstract

This paper presents the participation of team QUST in Task 8 SemEval 2024. We first performed data augmentation and cleaning on the dataset to enhance model training efficiency and accuracy. In the monolingual task, we evaluated traditional deep-learning methods, multiscale positive-unlabeled framework (MPU), fine-tuning, adapters and ensemble methods. Then, we selected the top-performing models based on their accuracy from the monolingual models and evaluated them in subtasks A and B. The final model construction employed a stacking ensemble that combined fine-tuning with MPU. Our system achieved 8th (scored 8th in terms of accuracy, officially ranked 13th) place in the official test set in multilingual settings of subtask A. We release our system code at:https://github.com/warmth27/SemEval2024_QUST
Paper Structure (19 sections, 1 figure, 4 tables)

This paper contains 19 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The data distribution in subtask A and B.