Table of Contents
Fetching ...

Detection of LLM-Generated Java Code Using Discretized Nested Bigrams

Timothy Paek, Chilukuri Mohan

TL;DR

This work tackles the problem of detecting LLM-generated Java code by framing it as a binary classification of code fragments. It introduces discretized nested bigram features derived from Abstract Syntax Tree structure and transformer signals (EWD_NB_F, EWD_CBNB_CM, CNB_F) and applies ensemble classifiers to achieve high accuracy across large-scale datasets, including GPT Dataset, GPT GCJ, and a 40-author benchmark. The results show near-perfect performance (up to 0.99 accuracy and AUC approaching 1.0) that significantly outperforms a widely used GPT-detection API and prior code-authorship methods, while maintaining relatively low feature dimensionality (as few as 12 features in some configurations). The work provides publicly available datasets and demonstrates scalable, robust detection of LLM-generated code, with implications for academic integrity and cybersecurity, and outlines directions for broader robustness and new discretization strategies.

Abstract

Large Language Models (LLMs) are currently used extensively to generate code by professionals and students, motivating the development of tools to detect LLM-generated code for applications such as academic integrity and cybersecurity. We address this authorship attribution problem as a binary classification task along with feature identification and extraction. We propose new Discretized Nested Bigram Frequency features on source code groups of various sizes. Compared to prior work, improvements are obtained by representing sparse information in dense membership bins. Experimental evaluation demonstrated that our approach significantly outperformed a commonly used GPT code-detection API and baseline features, with accuracy exceeding 96% compared to 72% and 79% respectively in detecting GPT-rewritten Java code fragments for 976 files with GPT 3.5 and GPT4 using 12 features. We also outperformed three prior works on code author identification in a 40-author dataset. Our approach scales well to larger data sets, and we achieved 99% accuracy and 0.999 AUC for 76,089 files and over 1,000 authors with GPT 4o using 227 features.

Detection of LLM-Generated Java Code Using Discretized Nested Bigrams

TL;DR

This work tackles the problem of detecting LLM-generated Java code by framing it as a binary classification of code fragments. It introduces discretized nested bigram features derived from Abstract Syntax Tree structure and transformer signals (EWD_NB_F, EWD_CBNB_CM, CNB_F) and applies ensemble classifiers to achieve high accuracy across large-scale datasets, including GPT Dataset, GPT GCJ, and a 40-author benchmark. The results show near-perfect performance (up to 0.99 accuracy and AUC approaching 1.0) that significantly outperforms a widely used GPT-detection API and prior code-authorship methods, while maintaining relatively low feature dimensionality (as few as 12 features in some configurations). The work provides publicly available datasets and demonstrates scalable, robust detection of LLM-generated code, with implications for academic integrity and cybersecurity, and outlines directions for broader robustness and new discretization strategies.

Abstract

Large Language Models (LLMs) are currently used extensively to generate code by professionals and students, motivating the development of tools to detect LLM-generated code for applications such as academic integrity and cybersecurity. We address this authorship attribution problem as a binary classification task along with feature identification and extraction. We propose new Discretized Nested Bigram Frequency features on source code groups of various sizes. Compared to prior work, improvements are obtained by representing sparse information in dense membership bins. Experimental evaluation demonstrated that our approach significantly outperformed a commonly used GPT code-detection API and baseline features, with accuracy exceeding 96% compared to 72% and 79% respectively in detecting GPT-rewritten Java code fragments for 976 files with GPT 3.5 and GPT4 using 12 features. We also outperformed three prior works on code author identification in a 40-author dataset. Our approach scales well to larger data sets, and we achieved 99% accuracy and 0.999 AUC for 76,089 files and over 1,000 authors with GPT 4o using 227 features.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Feature Extraction Flowchart
  • Figure 2: Two major components of the Feature Extraction process