Table of Contents
Fetching ...

Distinguishing LLM-generated from Human-written Code by Contrastive Learning

Xiaodan Xu, Chao Ni, Xinrong Guo, Shaoxuan Liu, Xiaoya Wang, Kui Liu, Xiaohu Yang

TL;DR

A novel ChatGPT-generated code detector based on a contrastive learning framework and a semantic encoder built with UniXcoder is proposed, which can effectively identify ChatGPT-generated code, outperforming all selected baselines.

Abstract

Large language models (LLMs), such as ChatGPT released by OpenAI, have attracted significant attention from both industry and academia due to their demonstrated ability to generate high-quality content for various tasks. Despite the impressive capabilities of LLMs, there are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering. Recently, several commercial and open-source LLM-generated content detectors have been proposed, which, however, are primarily designed for detecting natural language content without considering the specific characteristics of program code. This paper aims to fill this gap by proposing a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder. To assess the effectiveness of CodeGPTSensor on differentiating ChatGPT-generated code from human-written code, we first curate a large-scale Human and Machine comparison Corpus (HMCorp), which includes 550K pairs of human-written and ChatGPT-generated code (i.e., 288K Python code pairs and 222K Java code pairs). Based on the HMCorp dataset, our qualitative and quantitative analysis of the characteristics of ChatGPT-generated code reveals the challenge and opportunity of distinguishing ChatGPT-generated code from human-written code with their representative features. Our experimental results indicate that CodeGPTSensor can effectively identify ChatGPT-generated code, outperforming all selected baselines.

Distinguishing LLM-generated from Human-written Code by Contrastive Learning

TL;DR

A novel ChatGPT-generated code detector based on a contrastive learning framework and a semantic encoder built with UniXcoder is proposed, which can effectively identify ChatGPT-generated code, outperforming all selected baselines.

Abstract

Large language models (LLMs), such as ChatGPT released by OpenAI, have attracted significant attention from both industry and academia due to their demonstrated ability to generate high-quality content for various tasks. Despite the impressive capabilities of LLMs, there are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering. Recently, several commercial and open-source LLM-generated content detectors have been proposed, which, however, are primarily designed for detecting natural language content without considering the specific characteristics of program code. This paper aims to fill this gap by proposing a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder. To assess the effectiveness of CodeGPTSensor on differentiating ChatGPT-generated code from human-written code, we first curate a large-scale Human and Machine comparison Corpus (HMCorp), which includes 550K pairs of human-written and ChatGPT-generated code (i.e., 288K Python code pairs and 222K Java code pairs). Based on the HMCorp dataset, our qualitative and quantitative analysis of the characteristics of ChatGPT-generated code reveals the challenge and opportunity of distinguishing ChatGPT-generated code from human-written code with their representative features. Our experimental results indicate that CodeGPTSensor can effectively identify ChatGPT-generated code, outperforming all selected baselines.

Paper Structure

This paper contains 34 sections, 1 equation, 6 figures, 14 tables.

Figures (6)

  • Figure 1: An example pair of human-written and ChatGPT-generated Java functions from HMCorp-gj270412 javaexample.
  • Figure 2: An example pair of human-written and ChatGPT-generated Python functions from HMCorp-gp120870 pythonexample.
  • Figure 3: The framework of CodeGPTSensor.
  • Figure 4: The numerical counts of cases corresponding to each observation.
  • Figure 5: Confusion Matrices of CodeGPTSensor, GPTSniffer, and gpt-4-0613 for detecting ChatGPT-generated code on the sampled Java and Python subsets.
  • ...and 1 more figures