SemEval-2024 Task 8: Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection
Ayan Datta, Aryan Chandramania, Radhika Mamidi
TL;DR
The paper tackles detecting machine-generated text across diverse domains and generators by leveraging hierarchical information in transformer representations. It introduces a weighted layer-averaging approach over RoBERTa layers, producing the input to a classifier as a layer-weighted token-aggregation, and couples this with AdaLoRa for parameter-efficient fine-tuning. On SemEval-2024 Task 8 data, the method shows strong performance on evaluation but underperforms on official test data, highlighting generalization challenges and suggesting further hyperparameter tuning and alternative aggregation strategies. The work underscores the value of multi-layer linguistic signals for robust machine-generated text detection and offers a practical, parameter-efficient tuning framework for cross-domain detection tasks.
Abstract
This document contains the details of the authors' submission to the proceedings of SemEval 2024's Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection Subtask A (monolingual) and B. Detection of machine-generated text is becoming an increasingly important task, with the advent of large language models (LLMs). In this paper, we lay out how using weighted averages of RoBERTa layers lets us capture information about text that is relevant to machine-generated text detection.
