Table of Contents
Fetching ...

HULLMI: Human vs LLM identification with explainability

Prathamesh Dinesh Joshi, Sahil Pocker, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

TL;DR

This paper evaluates human-vs-LLM text detection by comparing traditional ML pipelines with modern detectors such as RoBERTa-Sentinel and T5-Sentinel, while emphasizing explainability through LIME. Using OpenGPTText and OpenWebText-like data alongside a custom test set, it shows that traditional TF-IDF/CountVectorizer-based models can match deep detectors on many metrics, and that LIME reveals feature-level rationales behind predictions across all models. The work highlights the value of combining robust detection with explainability for production-ready tools applicable to education, media, cybersecurity, and governance. It concludes with a call for hybrid approaches and broader LLM coverage to ensure reliable, trustworthy AI-content detection in diverse real-world contexts.

Abstract

As LLMs become increasingly proficient at producing human-like responses, there has been a rise of academic and industrial pursuits dedicated to flagging a given piece of text as "human" or "AI". Most of these pursuits involve modern NLP detectors like T5-Sentinel and RoBERTa-Sentinel, without paying too much attention to issues of interpretability and explainability of these models. In our study, we provide a comprehensive analysis that shows that traditional ML models (Naive-Bayes,MLP, Random Forests, XGBoost) perform as well as modern NLP detectors, in human vs AI text detection. We achieve this by implementing a robust testing procedure on diverse datasets, including curated corpora and real-world samples. Subsequently, by employing the explainable AI technique LIME, we uncover parts of the input that contribute most to the prediction of each model, providing insights into the detection process. Our study contributes to the growing need for developing production-level LLM detection tools, which can leverage a wide range of traditional as well as modern NLP detectors we propose. Finally, the LIME techniques we demonstrate also have the potential to equip these detection tools with interpretability analysis features, making them more reliable and trustworthy in various domains like education, healthcare, and media.

HULLMI: Human vs LLM identification with explainability

TL;DR

This paper evaluates human-vs-LLM text detection by comparing traditional ML pipelines with modern detectors such as RoBERTa-Sentinel and T5-Sentinel, while emphasizing explainability through LIME. Using OpenGPTText and OpenWebText-like data alongside a custom test set, it shows that traditional TF-IDF/CountVectorizer-based models can match deep detectors on many metrics, and that LIME reveals feature-level rationales behind predictions across all models. The work highlights the value of combining robust detection with explainability for production-ready tools applicable to education, media, cybersecurity, and governance. It concludes with a call for hybrid approaches and broader LLM coverage to ensure reliable, trustworthy AI-content detection in diverse real-world contexts.

Abstract

As LLMs become increasingly proficient at producing human-like responses, there has been a rise of academic and industrial pursuits dedicated to flagging a given piece of text as "human" or "AI". Most of these pursuits involve modern NLP detectors like T5-Sentinel and RoBERTa-Sentinel, without paying too much attention to issues of interpretability and explainability of these models. In our study, we provide a comprehensive analysis that shows that traditional ML models (Naive-Bayes,MLP, Random Forests, XGBoost) perform as well as modern NLP detectors, in human vs AI text detection. We achieve this by implementing a robust testing procedure on diverse datasets, including curated corpora and real-world samples. Subsequently, by employing the explainable AI technique LIME, we uncover parts of the input that contribute most to the prediction of each model, providing insights into the detection process. Our study contributes to the growing need for developing production-level LLM detection tools, which can leverage a wide range of traditional as well as modern NLP detectors we propose. Finally, the LIME techniques we demonstrate also have the potential to equip these detection tools with interpretability analysis features, making them more reliable and trustworthy in various domains like education, healthcare, and media.
Paper Structure (37 sections, 8 figures, 5 tables)

This paper contains 37 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The figure displays (a) Receiver Operating Characteristic (ROC) curves and (b) Detection Error Tradeoff (DET) curves for the various traditional ML models analyzed. The ROC curves illustrate the performance of each model in distinguishing between positive and negative classes. Notably, the Random Forest, XGBoost, and MLP models achieve the highest AUC values of 0.99, indicating excellent classification performance. Logistic Regression and LSTM follow closely with AUCs of 0.98, while Naive Bayes has a lower AUC of 0.93. The DET curves reveal that models with higher ROC AUCs, such as Random Forest, XGBoost, and MLP, tend to have lower DET rates, demonstrating that higher ROC values correspond to better overall performance. This indicates that these models are not only effective at distinguishing between classes but also balanced in minimizing both false positive and false negative rates.
  • Figure 2: Figure shows (a) Receiver Operating Characteristic (ROC) curve for Roberta and (b) Receiver Operating Characteristic (ROC) curve for T5.
  • Figure 3: The stacked bar graph compares the performance of eight models on the OpenGPTText-Final Dataset in distinguishing AI-generated and human-written text. The T5-Sentinel model shows the highest effectiveness with a 0.996 probability of correctly identifying AI-generated text (TPR) and 0.96 for human-written text (TNR). Roberta-Sentinel follows closely with a TPR of 0.97 and TNR of 0.95. In contrast, Naive Bayes has a 0.94 TNR but only a 0.44 TPR, favoring human-written text detection. Random Forest struggles with a lower TNR of 0.78, while Logistic Regression and LSTM perform consistently around 0.88-0.93 for both TPR and TNR. Overall, T5-Sentinel provides the most balanced and accurate classification for this dataset.
  • Figure 4: The stacked bar graph compares eight models' performance in classifying AI-generated and human-written text based on True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), and False Negative Rate (FNR). RoBERTa excels with a TPR of 0.96 and a TNR of 0.88, demonstrating strong performance in identifying both text types while maintaining a low FPR of 0.12 and FNR of 0.04. MLP also performs well with a TPR of 0.96 and a TNR of 0.80, though it has a slightly higher FPR of 0.20. Logistic Regression achieves a TPR of 0.96 and a TNR of 0.72, with an FPR of 0.28. T5 shows balanced performance with a TPR of 0.92 and a TNR of 0.84, alongside an FPR of 0.16. Naive Bayes has the highest TNR of 1.00 but a lower TPR of 0.56 and FNR of 0.44, indicating challenges in detecting AI-generated text. Random Forests and XGBoost show moderate results, while LSTM has the lowest TPR of 0.56 and TNR of 0.64, suggesting less effective performance in distinguishing between text types.
  • Figure 5: This is the one of the sample generated by ChatGPT-4o, detailed sample can be accessed in the Github repository link provided in Section 6.
  • ...and 3 more figures