Table of Contents
Fetching ...

Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu

TL;DR

The paper tackles the problem of distinguishing machine-generated code from human-written code to preserve software integrity. It conducts a thorough empirical analysis of code patterns across lexical diversity, conciseness, and naturalness, revealing that machine-generated Python tends to be more concise and natural, with distinctive whitespace usage. Building on these insights, the authors introduce DetectCodeGPT, a zero-shot detector that perturbs code by inserting spaces and newlines and uses Normalized Perturbed Log Rank (NPR) to assess naturalness, avoiding reliance on external LLMs. Across CodeSearchNet and The Stack, and for six code-language models, DetectCodeGPT achieves a statistically significant AUROC advantage (average about 7.6%) over baselines and demonstrates robustness in cross-model scenarios. These findings offer a practical approach to safeguarding code provenance and authenticity in real-world software development pipelines.

Abstract

Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.

Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

TL;DR

The paper tackles the problem of distinguishing machine-generated code from human-written code to preserve software integrity. It conducts a thorough empirical analysis of code patterns across lexical diversity, conciseness, and naturalness, revealing that machine-generated Python tends to be more concise and natural, with distinctive whitespace usage. Building on these insights, the authors introduce DetectCodeGPT, a zero-shot detector that perturbs code by inserting spaces and newlines and uses Normalized Perturbed Log Rank (NPR) to assess naturalness, avoiding reliance on external LLMs. Across CodeSearchNet and The Stack, and for six code-language models, DetectCodeGPT achieves a statistically significant AUROC advantage (average about 7.6%) over baselines and demonstrates robustness in cross-model scenarios. These findings offer a practical approach to safeguarding code provenance and authenticity in real-world software development pipelines.

Abstract

Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
Paper Structure (46 sections, 5 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 46 sections, 5 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Syntax element distribution of the code corpus
  • Figure 2: Comparison of Zipf's and Heaps' laws on machine- and human-authored code
  • Figure 3: Distribution of code length for machine- and human-authored code
  • Figure 4: Distribution of naturalness scores
  • Figure 5: Examples of machine- and human-authored code snippets with corresponding predictions.
  • ...and 4 more figures