AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
Basak Demirok, Mucahid Kutlu
TL;DR
This paper tackles the problem of distinguishing AI-generated code from human-written code by introducing AIGCodeSet, a large, reproducible dataset built from CodeNet Python problems. It combines 4,755 human-written codes with 2,828 AI-generated codes produced by three LLMs (CodeLlama, Codestral, Gemini) across three usage scenarios, after filtering for quality. Baseline detectors using Ada embeddings, TF-IDF vectors, and a Bayes classifier are evaluated, revealing that the Bayes approach generally yields the best recall and F1, though performance depends strongly on the specific LLM and generation scenario. The work provides a valuable benchmark and emphasizes that detectability varies with how AI code is produced (from-scratch vs. fixing outputs), highlighting practical implications for evaluating and deploying AI-assisted coding tools. The dataset and baselines enable reproducible research and motivate future expansion to more languages, models, and usage contexts, including user studies of real-world LLM use in software development.
Abstract
While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.
