Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection
Feng Xiong, Thanet Markchom, Ziwei Zheng, Subin Jung, Varun Ojha, Huizhi Liang
TL;DR
The paper tackles detecting machine-generated text across multiple generators, domains, and languages by evaluating Subtasks A and B of SemEval-2024 Task 8. It compares classical NLP-driven ML pipelines with transformer-based fine-tuning, including XLM-RoBERTa, LoRA-RoBERTa, and DistilmBERT, and also investigates a majority voting ensemble. Results show transformer models, particularly LoRA-RoBERTa, outperform traditional methods, with ensemble voting yielding strong gains in multilingual settings. The findings demonstrate that efficient fine-tuning via LoRA and ensemble techniques can provide robust, scalable detection across diverse linguistic and domain contexts.
Abstract
SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs) in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional machine learning (ML) with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.
