Table of Contents
Fetching ...

Vietnamese AI Generated Text Detection

Quang-Dan Tran, Van-Quan Nguyen, Quang-Huy Pham, K. B. Thang Nguyen, Trong-Hop Do

TL;DR

Vietnamese researchers address the challenge of detecting AI-generated text in Vietnamese by introducing ViDetect, a benchmark dataset of about 6,800 essays balanced between human- and AI-written content. They fine-tune and evaluate multiple Vietnamese-oriented transformer models (ViT5, BARTpho, PhoBERT, mDeBERTa-v3, and multilingual BERT) on binary classification to identify AI-generated content. The results across accuracy, F1, and AUROC demonstrate that transformer-based detectors are effective in Vietnamese, with AUROC improving as input length increases, and highlight the need for language-specific detectors. The work provides a valuable dataset and baseline models for future research in Vietnamese NLP safety and authenticity detection.

Abstract

In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.

Vietnamese AI Generated Text Detection

TL;DR

Vietnamese researchers address the challenge of detecting AI-generated text in Vietnamese by introducing ViDetect, a benchmark dataset of about 6,800 essays balanced between human- and AI-written content. They fine-tune and evaluate multiple Vietnamese-oriented transformer models (ViT5, BARTpho, PhoBERT, mDeBERTa-v3, and multilingual BERT) on binary classification to identify AI-generated content. The results across accuracy, F1, and AUROC demonstrate that transformer-based detectors are effective in Vietnamese, with AUROC improving as input length increases, and highlight the need for language-specific detectors. The work provides a valuable dataset and baseline models for future research in Vietnamese NLP safety and authenticity detection.

Abstract

In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.
Paper Structure (27 sections, 3 equations, 2 figures, 5 tables)

This paper contains 27 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Picture of LLM-generated text detection task. This task is a binary classification task that detects whether the provided text is generated by LLMs or written by humans.
  • Figure 2: Overview of ViDetect dataset creation process.