Table of Contents
Fetching ...

Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT

Zhen Tao, Yanfang Chen, Dinghao Xi, Zhiyu Li, Wei Xu

TL;DR

This paper introduces CUDRT, a bilingual evaluation framework and dataset for detecting LLM-generated text across five operations: Create, Update, Delete, Rewrite, and Translate. It provides a large Chinese–English corpus with operation-specific generation from diverse LLMs and evaluates both metric-based (MPU) and model-based (RoBERTa, XLNet) detectors under cross-dataset, cross-operation, and cross-LLM settings. Key findings show that operation-type data, especially Delete, Update, and Rewrite, can improve detector generalization, and that detectors trained on GPT-4-1106- or Qwen-derived texts generalize better across LLMs. The framework supports scalable, multilingual evaluation and offers actionable guidance for building robust, cross-linguistic LLM-detection systems applicable to real-world moderation and information integrity tasks.

Abstract

The increasing prevalence of large language models (LLMs) has significantly advanced text generation, but the human-like quality of LLM outputs presents major challenges in reliably distinguishing between human-authored and LLM-generated texts. Existing detection benchmarks are constrained by their reliance on static datasets, scenario-specific tasks (e.g., question answering and text refinement), and a primary focus on English, overlooking the diverse linguistic and operational subtleties of LLMs. To address these gaps, we propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English, categorizing LLM activities into five key operations: Create, Update, Delete, Rewrite, and Translate. CUDRT provides extensive datasets tailored to each operation, featuring outputs from state-of-the-art LLMs to assess the reliability of LLM-generated text detectors. This framework supports scalable, reproducible experiments and enables in-depth analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance. Our extensive experiments demonstrate the framework's capacity to optimize detection systems, providing critical insights to enhance reliability, cross-linguistic adaptability, and detection accuracy. By advancing robust methodologies for identifying LLM-generated texts, this work contributes to the development of intelligent systems capable of meeting real-world multilingual detection challenges. Source code and dataset are available at GitHub.

Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT

TL;DR

This paper introduces CUDRT, a bilingual evaluation framework and dataset for detecting LLM-generated text across five operations: Create, Update, Delete, Rewrite, and Translate. It provides a large Chinese–English corpus with operation-specific generation from diverse LLMs and evaluates both metric-based (MPU) and model-based (RoBERTa, XLNet) detectors under cross-dataset, cross-operation, and cross-LLM settings. Key findings show that operation-type data, especially Delete, Update, and Rewrite, can improve detector generalization, and that detectors trained on GPT-4-1106- or Qwen-derived texts generalize better across LLMs. The framework supports scalable, multilingual evaluation and offers actionable guidance for building robust, cross-linguistic LLM-detection systems applicable to real-world moderation and information integrity tasks.

Abstract

The increasing prevalence of large language models (LLMs) has significantly advanced text generation, but the human-like quality of LLM outputs presents major challenges in reliably distinguishing between human-authored and LLM-generated texts. Existing detection benchmarks are constrained by their reliance on static datasets, scenario-specific tasks (e.g., question answering and text refinement), and a primary focus on English, overlooking the diverse linguistic and operational subtleties of LLMs. To address these gaps, we propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English, categorizing LLM activities into five key operations: Create, Update, Delete, Rewrite, and Translate. CUDRT provides extensive datasets tailored to each operation, featuring outputs from state-of-the-art LLMs to assess the reliability of LLM-generated text detectors. This framework supports scalable, reproducible experiments and enables in-depth analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance. Our extensive experiments demonstrate the framework's capacity to optimize detection systems, providing critical insights to enhance reliability, cross-linguistic adaptability, and detection accuracy. By advancing robust methodologies for identifying LLM-generated texts, this work contributes to the development of intelligent systems capable of meeting real-world multilingual detection challenges. Source code and dataset are available at GitHub.
Paper Structure (35 sections, 6 equations, 15 figures, 10 tables)

This paper contains 35 sections, 6 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Illustration of the CUDRT evaluation framework.
  • Figure 2: The process of generating text through the "Complete" operation of LLMs.
  • Figure 3: The process of generating text through the "Polish" operation of LLMs.
  • Figure 4: The process of generating text through the "Expand" operation of LLMs.
  • Figure 5: The process of generating text through the "Summary" operation of LLMs.
  • ...and 10 more figures