Table of Contents
Fetching ...

Discourse Features Enhance Detection of Document-Level Machine-Generated Content

Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

TL;DR

This work tackles the challenge of detecting document-level machine-generated content, especially when humans revise or paraphrase long texts. It argues that surface-level features are insufficient and introduces discourse-aware modeling via DTransformer, which integrates PDTB-based discourse structure with semantic content. Two new datasets, paraLFQA and paraWP, are developed to probe discourse paraphrasing in long-form content, and DTransformer achieves substantial gains over state-of-the-art detectors across paraLFQA, paraWP, Plagbench, and M4, with notable ablation evidence that discourse features are the primary source of improvement. The approach advances robust detection of document-level MGC and suggests that explicit modeling of discourse structure is essential for resisting discourse-level paraphrase attacks in real-world settings.

Abstract

The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.

Discourse Features Enhance Detection of Document-Level Machine-Generated Content

TL;DR

This work tackles the challenge of detecting document-level machine-generated content, especially when humans revise or paraphrase long texts. It argues that surface-level features are insufficient and introduces discourse-aware modeling via DTransformer, which integrates PDTB-based discourse structure with semantic content. Two new datasets, paraLFQA and paraWP, are developed to probe discourse paraphrasing in long-form content, and DTransformer achieves substantial gains over state-of-the-art detectors across paraLFQA, paraWP, Plagbench, and M4, with notable ablation evidence that discourse features are the primary source of improvement. The approach advances robust detection of document-level MGC and suggests that explicit modeling of discourse structure is essential for resisting discourse-level paraphrase attacks in real-world settings.

Abstract

The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.

Paper Structure

This paper contains 10 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of human-written text (left) with its corresponding GPT-revised text (right).
  • Figure 2: PDTB annotation example lin2010pdtbstyled: It shows the implicit relations, ACCORDINGLY, between the given two sentence.
  • Figure 3: DTransformer model: it incorporates both structural and semantic features. It first splits the paragraphs and employs a hierarchical model to capture document semantic features. Then, using a cross-attention mechanism, it analyses discourse features for improved classification.