Table of Contents
Fetching ...

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Basak Demirok, Mucahid Kutlu

TL;DR

This paper tackles the problem of distinguishing AI-generated code from human-written code by introducing AIGCodeSet, a large, reproducible dataset built from CodeNet Python problems. It combines 4,755 human-written codes with 2,828 AI-generated codes produced by three LLMs (CodeLlama, Codestral, Gemini) across three usage scenarios, after filtering for quality. Baseline detectors using Ada embeddings, TF-IDF vectors, and a Bayes classifier are evaluated, revealing that the Bayes approach generally yields the best recall and F1, though performance depends strongly on the specific LLM and generation scenario. The work provides a valuable benchmark and emphasizes that detectability varies with how AI code is produced (from-scratch vs. fixing outputs), highlighting practical implications for evaluating and deploying AI-assisted coding tools. The dataset and baselines enable reproducible research and motivate future expansion to more languages, models, and usage contexts, including user studies of real-world LLM use in software development.

Abstract

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

TL;DR

This paper tackles the problem of distinguishing AI-generated code from human-written code by introducing AIGCodeSet, a large, reproducible dataset built from CodeNet Python problems. It combines 4,755 human-written codes with 2,828 AI-generated codes produced by three LLMs (CodeLlama, Codestral, Gemini) across three usage scenarios, after filtering for quality. Baseline detectors using Ada embeddings, TF-IDF vectors, and a Bayes classifier are evaluated, revealing that the Bayes approach generally yields the best recall and F1, though performance depends strongly on the specific LLM and generation scenario. The work provides a valuable benchmark and emphasizes that detectability varies with how AI code is produced (from-scratch vs. fixing outputs), highlighting practical implications for evaluating and deploying AI-assisted coding tools. The dataset and baselines enable reproducible research and motivate future expansion to more languages, models, and usage contexts, including user studies of real-world LLM use in software development.

Abstract

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

Paper Structure

This paper contains 11 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Human written codes vs. Codes generated from scratch in AIGCodeSet.