Qwen it detect machine-generated text?

Teodor-George Marchitan; Claudiu Creanga; Liviu P. Dinu

Qwen it detect machine-generated text?

Teodor-George Marchitan, Claudiu Creanga, Liviu P. Dinu

TL;DR

This work tackles binary multilingual machine-generated text detection for Coling 2025 Task 1 by comparing causal models (last-layer training) and masked models (LoRA-fine-tuned) across monolingual and multilingual tracks. The Qwen2.5-0.5B-based causal approach, with data balancing and constrained token length, achieves the top F1 Micro score in the monolingual track (0.8333) and near-top F1 Macro (0.8301), while masked, LoRA-tuned XLM-Roberta-Base provides a strong alternative. The multilingual results lag behind the monolingual ones, highlighting cross-language generalization challenges, with error analysis revealing strong performance on some unseen sources (e.g., ChatGPT-related data) but weaknesses on others (e.g., Mixset). Overall, the paper demonstrates effective architecture choices for subtask A and outlines concrete directions (language-specific fine-tuning, data augmentation, and latent feature exploitation) to mitigate overfitting and improve multilingual robustness in future work.

Abstract

This paper describes the approach of the Unibuc - NLP team in tackling the Coling 2025 GenAI Workshop, Task 1: Binary Multilingual Machine-Generated Text Detection. We explored both masked language models and causal models. For Subtask A, our best model achieved first-place out of 36 teams when looking at F1 Micro (Auxiliary Score) of 0.8333, and second-place when looking at F1 Macro (Main Score) of 0.8301

Qwen it detect machine-generated text?

TL;DR

Abstract

Paper Structure (10 sections, 6 figures, 3 tables)

This paper contains 10 sections, 6 figures, 3 tables.

Introduction
Background
Dataset
Previous Work
System overview
Causal models
Masked models
Results
Error Analysis
Conclusions and Future Work

Figures (6)

Figure 1: Subtask A: Distribution of token length for the training dataset.
Figure 2: Subtask A: Distribution of token length for the test dataset. We can see it is significantly different from the training set.
Figure 3: Subtask A: monolingual - accuracy by source for test set. We obtain best accuracy on NLPeer datasets, almost $100\%$.
Figure 4: Subtask A: monolingual - accuracy by model for test set. We obtained best accuracy on ChatGPT, but otherwise there is not a lot of variation between models.
Figure 5: Subtask A: monolingual - Sources of the text in test dataset. In the training dataset we had only 3 sources: mage ($46\%$), m4gt ($42\%$) and hc3 ($11\%$).
...and 1 more figures

Qwen it detect machine-generated text?

TL;DR

Abstract

Qwen it detect machine-generated text?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)