Table of Contents
Fetching ...

Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry

Andrea Gurioli, Maurizio Gabbrielli, Stefano Zacchiroli

TL;DR

This paper tackles the problem of automatically recognizing AI-written source code across multiple programming languages. It introduces a transformer-based encoder classifier trained on a novel open, multilingual dataset (H-AIRosettaMP) built by translating human-written Rosetta Code solutions into AI-written counterparts via StarCoder2, enabling cross-language detection with a single model. The approach achieves an average accuracy around the mid-80s across 10 languages and demonstrates strong reproducibility by releasing data, models, and tooling openly. The study also compares to existing baselines and tests on ChatGPT-generated data, highlighting both the potential and limitations of current detectors, and argues for reproducible, open benchmarks as a baseline for future AI code stylometry. Overall, the work provides a robust, multilingual, and fully open framework for detecting AI-generated code with practical implications for policy and software provenance.

Abstract

With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns.We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% $\pm$ 3.8%).Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121 247 code snippets in 10 popular programming languages, labeled as either human-written or AI-generated. The experimental pipeline (dataset, training code, resulting models) is the first fully reproducible one for the AI code stylometry task. Most notably our experiments rely only on open LLMs, rather than on proprietary/closed ones like ChatGPT.

Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry

TL;DR

This paper tackles the problem of automatically recognizing AI-written source code across multiple programming languages. It introduces a transformer-based encoder classifier trained on a novel open, multilingual dataset (H-AIRosettaMP) built by translating human-written Rosetta Code solutions into AI-written counterparts via StarCoder2, enabling cross-language detection with a single model. The approach achieves an average accuracy around the mid-80s across 10 languages and demonstrates strong reproducibility by releasing data, models, and tooling openly. The study also compares to existing baselines and tests on ChatGPT-generated data, highlighting both the potential and limitations of current detectors, and argues for reproducible, open benchmarks as a baseline for future AI code stylometry. Overall, the work provides a robust, multilingual, and fully open framework for detecting AI-generated code with practical implications for policy and software provenance.

Abstract

With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns.We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% 3.8%).Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121 247 code snippets in 10 popular programming languages, labeled as either human-written or AI-generated. The experimental pipeline (dataset, training code, resulting models) is the first fully reproducible one for the AI code stylometry task. Most notably our experiments rely only on open LLMs, rather than on proprietary/closed ones like ChatGPT.

Paper Structure

This paper contains 25 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Experimental methodology. The process is divided into two main steps: (1) The Dataset construction, which starts from the filtered Rosetta Code dataset and terminates in the H-AIRosettaMP, obtained via code translation, comprising 90 (sub-)datasets. Each dataset is labeled by the author (Human or AI) and is represented by dst (the language of the dataset) and src language (the provenance of the AI-generated part of the dataset); (2) The model training, that shows the process leading to 90 monolingual models (one per dataset) and 1 multilingual model.
  • Figure 2: Code translation step. The human-labeled part of the dataset (the ArrayLength Java class here) is a solution to a task from Rosetta Code (Array concatenation). The AI-labeled part is obtained via code translation (from Python to Java) using StarCoder 2. Input to the translation is a human-written solution for the same task, in a different programming language.
  • Figure 3: Synopsis of the prompt given to StarCoder2 for translating a given code snippet (CODE_SNIPPET in the text) from a source programming language (SOURCE_LANGUAGE) to a target one (TARGET_LANGUAGE). A prefix of the desired answer (Here is the…) is provided because StarCoder2 has not been fine-tuned for chat-based interaction and is strictly a completion model.
  • Figure 4: Transformer-based architecture of the Human/AI stylometry classifier. Input source code is tokenized and provided as input to the CodeT5plus encoder, which produces as outputs multiple vectorial representations. The first token (<s> in the figure) is used as input for the classification head, which produces the final class probability.
  • Figure 5: Distribution of unique tasks for which solutions are present in the dataset per language. For each task, both a human-written and an AI-written snippet is always provided. Overlapping tasks denote the number of tasks for which multiple AI-written solutions are present, with a guarantee that they have been translated from all other programming languages in the dataset.
  • ...and 1 more figures