Table of Contents
Fetching ...

ChatGPT Code Detection: Techniques for Uncovering the Source of Code

Marc Oedingen, Raphael C. Engelhardt, Robin Denz, Maximilian Hammer, Wolfgang Konen

TL;DR

This work tackles the problem of distinguishing human-written Python code from ChatGPT-generated code using a broad suite of features and models. It demonstrates that embedding-based representations significantly outperform traditional white-box features, with peak accuracy around $98 ext{%}$ and AUC near 0.999, while presenting a calibrated Bayes classifier for explainability. The study leverages a large, carefully balanced, problem-wise dataset and tests various formats, showing that code formatting modestly degrades performance but overall embedding methods remain robust. The findings have practical implications for education, software development, and assessment integrity, and they provide publicly available data and models to spur future research and policy development.

Abstract

In recent times, large language models (LLMs) have made significant strides in generating computer code, blurring the lines between code created by humans and code produced by artificial intelligence (AI). As these technologies evolve rapidly, it is crucial to explore how they influence code generation, especially given the risk of misuse in areas like higher education. This paper explores this issue by using advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT, a type of LLM. We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms - including Deep Neural Networks, Random Forests, and Extreme Gradient Boosting - to achieve this differentiation with an impressive accuracy of 98%. For the successful combinations, we also examine their model calibration, showing that some of the models are extremely well calibrated. Additionally, we present white-box features and an interpretable Bayes classifier to elucidate critical differences between the code sources, enhancing the explainability and transparency of our approach. Both approaches work well but provide at most 85-88% accuracy. We also show that untrained humans solve the same task not better than random guessing. This study is crucial in understanding and mitigating the potential risks associated with using AI in code generation, particularly in the context of higher education, software development, and competitive programming.

ChatGPT Code Detection: Techniques for Uncovering the Source of Code

TL;DR

This work tackles the problem of distinguishing human-written Python code from ChatGPT-generated code using a broad suite of features and models. It demonstrates that embedding-based representations significantly outperform traditional white-box features, with peak accuracy around and AUC near 0.999, while presenting a calibrated Bayes classifier for explainability. The study leverages a large, carefully balanced, problem-wise dataset and tests various formats, showing that code formatting modestly degrades performance but overall embedding methods remain robust. The findings have practical implications for education, software development, and assessment integrity, and they provide publicly available data and models to spur future research and policy development.

Abstract

In recent times, large language models (LLMs) have made significant strides in generating computer code, blurring the lines between code created by humans and code produced by artificial intelligence (AI). As these technologies evolve rapidly, it is crucial to explore how they influence code generation, especially given the risk of misuse in areas like higher education. This paper explores this issue by using advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT, a type of LLM. We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms - including Deep Neural Networks, Random Forests, and Extreme Gradient Boosting - to achieve this differentiation with an impressive accuracy of 98%. For the successful combinations, we also examine their model calibration, showing that some of the models are extremely well calibrated. Additionally, we present white-box features and an interpretable Bayes classifier to elucidate critical differences between the code sources, enhancing the explainability and transparency of our approach. Both approaches work well but provide at most 85-88% accuracy. We also show that untrained humans solve the same task not better than random guessing. This study is crucial in understanding and mitigating the potential risks associated with using AI in code generation, particularly in the context of higher education, software development, and competitive programming.
Paper Structure (34 sections, 5 equations, 12 figures, 6 tables)

This paper contains 34 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure S1: Flowchart of our Code Detection Methodology
  • Figure S2: Example of row in dataset
  • Figure S3: Distribution of the code length (number of tokens according to cl100k_base encoding) across the unformatted and formatted dataset. Values larger than the $99\%$ quantile were removed to avoid a distorted picture.
  • Figure S4: Cosine similarity between all human and GPT code samples embedded using Ada and TFIDF, both formatted and unformatted.
  • Figure S5: Box plot of human-designed features for formatted and unformatted dataset.
  • ...and 7 more figures