Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry
Andrea Gurioli, Maurizio Gabbrielli, Stefano Zacchiroli
TL;DR
This paper tackles the problem of automatically recognizing AI-written source code across multiple programming languages. It introduces a transformer-based encoder classifier trained on a novel open, multilingual dataset (H-AIRosettaMP) built by translating human-written Rosetta Code solutions into AI-written counterparts via StarCoder2, enabling cross-language detection with a single model. The approach achieves an average accuracy around the mid-80s across 10 languages and demonstrates strong reproducibility by releasing data, models, and tooling openly. The study also compares to existing baselines and tests on ChatGPT-generated data, highlighting both the potential and limitations of current detectors, and argues for reproducible, open benchmarks as a baseline for future AI code stylometry. Overall, the work provides a robust, multilingual, and fully open framework for detecting AI-generated code with practical implications for policy and software provenance.
Abstract
With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns.We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% $\pm$ 3.8%).Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121 247 code snippets in 10 popular programming languages, labeled as either human-written or AI-generated. The experimental pipeline (dataset, training code, resulting models) is the first fully reproducible one for the AI code stylometry task. Most notably our experiments rely only on open LLMs, rather than on proprietary/closed ones like ChatGPT.
