Table of Contents
Fetching ...

PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?

Kseniia Petukhova, Roman Kazakov, Ekaterina Kochmar

TL;DR

The paper addresses detecting machine-generated text in English under SemEval-2024 Task 8 by proposing a black-box detector that fuses RoBERTa-base CLS embeddings with diverse linguistic features and a resampled training strategy. The approach demonstrates strong generalization, achieving 0.95 accuracy on development data and 0.91 on the test set, ranking 12th among 124 teams, and shows that linguistics-only features can match competitive baselines. Key findings highlight that lexical diversity features synergize best with embeddings, and careful training-data selection (notably WikiHow-focused subsets) enhances cross-domain/model performance. The work highlights practical implications for robust MGT detection and suggests avenues for expanding linguistic features and data-selection techniques in future research.

Abstract

In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.

PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?

TL;DR

The paper addresses detecting machine-generated text in English under SemEval-2024 Task 8 by proposing a black-box detector that fuses RoBERTa-base CLS embeddings with diverse linguistic features and a resampled training strategy. The approach demonstrates strong generalization, achieving 0.95 accuracy on development data and 0.91 on the test set, ranking 12th among 124 teams, and shows that linguistics-only features can match competitive baselines. Key findings highlight that lexical diversity features synergize best with embeddings, and careful training-data selection (notably WikiHow-focused subsets) enhances cross-domain/model performance. The work highlights practical implications for robust MGT detection and suggests avenues for expanding linguistic features and data-selection techniques in future research.

Abstract

In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
Paper Structure (24 sections, 3 figures, 5 tables)

This paper contains 24 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: For each text, we get a [CLS] token embedding from an autoencoder model and extract vectors of linguistic features (e.g., lexical diversity, stylometry, etc.). Then, we pass the concatenated vector to a feed-forward network, whose output layer performs binary classification -- HWT vs. MGT. The configurations of embeddings/features may vary between experiments.
  • Figure 2: Performance of our classifier across models.
  • Figure 3: Performance of our classifier across domains (on the development set).