PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?
Kseniia Petukhova, Roman Kazakov, Ekaterina Kochmar
TL;DR
The paper addresses detecting machine-generated text in English under SemEval-2024 Task 8 by proposing a black-box detector that fuses RoBERTa-base CLS embeddings with diverse linguistic features and a resampled training strategy. The approach demonstrates strong generalization, achieving 0.95 accuracy on development data and 0.91 on the test set, ranking 12th among 124 teams, and shows that linguistics-only features can match competitive baselines. Key findings highlight that lexical diversity features synergize best with embeddings, and careful training-data selection (notably WikiHow-focused subsets) enhances cross-domain/model performance. The work highlights practical implications for robust MGT detection and suggests avenues for expanding linguistic features and data-selection techniques in future research.
Abstract
In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
