Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts

Javier J. Sanchez-Medina

Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts

Javier J. Sanchez-Medina

TL;DR

Problem: Distinguishing human- and AI-generated scientific texts in academic contexts. Approach: a sentiment-analysis feature engineering pipeline using four lexicons (Bing, Afinn, NRC, Loughran-McDonald) to derive features, followed by a Random Forest classifier. Findings: on a dataset of 145 abstracts comprising 72 human and 73 AI-generated samples, the method achieves an overall accuracy of $0.84$ and a ROC-AUC of about $0.877$, with robust per-class performance. Significance: suggests a practical, lexicon-driven framework for detecting AI-authored scientific content and motivates integration with additional features and testing on newer LLMs such as GPT-4.

Abstract

After the launch of ChatGPT v.4 there has been a global vivid discussion on the ability of this artificial intelligence powered platform and some other similar ones for the automatic production of all kinds of texts, including scientific and technical texts. This has triggered a reflection in many institutions on whether education and academic procedures should be adapted to the fact that in future many texts we read will not be written by humans (students, scholars, etc.), at least, not entirely. In this work it is proposed a new methodology to classify texts coming from an automatic text production engine or a human, based on Sentiment Analysis as a source for feature engineering independent variables and then train with them a Random Forest classification algorithm. Using four different sentiment lexicons, a number of new features where produced, and then fed to a machine learning random forest methodology, to train such a model. Results seem very convincing that this may be a promising research line to detect fraud, in such environments where human are supposed to be the source of texts.

Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts

TL;DR

and a ROC-AUC of about

, with robust per-class performance. Significance: suggests a practical, lexicon-driven framework for detecting AI-authored scientific content and motivates integration with additional features and testing on newer LLMs such as GPT-4.

Abstract

Paper Structure (10 sections, 6 figures, 3 tables)

This paper contains 10 sections, 6 figures, 3 tables.

Introduction
State of the Art
Data Ingestion and Preparation. Exploratory Analysis
Feature Engineering. Data Imputation, Data Augmentation and Sentiment Analysis
Preprocessing
Feature Engineering
Random Forest
Methodology
Results
Conclusions

Figures (6)

Figure 1: 35 most frequent stemmed words in the original paper abstracts, as published
Figure 2: 35 most frequent stemmed words in the ChatGPT v.3 generated paper abstracts
Figure 3: Stemmed words frequencies in the original papers versus the ChatGPT generated
Figure 4: Inner Join between texts words and lexicons
Figure 5: Data Prepared Structure
...and 1 more figures

Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts

TL;DR

Abstract

Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (6)