Table of Contents
Fetching ...

Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features

Shinwoo Park, Hyundong Jin, Jeong-won Cha, Yo-Sub Han

TL;DR

This work introduces LPcode, a dataset of human-written code and paraphrased variants generated by four LLMs, to study whether code paraphrasing by LLMs can be detected and which model performed the paraphrase. It proposes LPcodedec, a coding-style–based detector that uses 10 features from naming, structure, and readability, combined with a 20-dimensional feature vector, and trains an MLP for two tasks: paraphrase detection and LLM provenance tracking. Across C, C++, Java, and Python, LPcodedec outperforms strong baselines, achieving a $F1$-score improvement of $2.64\%$ on paraphrase detection and $15.17\%$ on provenance tracking, with speedups up to $1,343\times$ and $213\times$ respectively. The approach demonstrates that coding style fingerprints are informative for both detecting LLM paraphrasing and identifying the responsible LLM, enabling applications in plagiarism detection and AI usage transparency, while acknowledging limitations in language and model coverage and the potential for adversarial obfuscation.

Abstract

Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and LLM-paraphrased code, particularly in terms of naming consistency, code structure, and readability. Based on these findings, we develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code, and discover which LLM is used for the paraphrasing. LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features.

Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features

TL;DR

This work introduces LPcode, a dataset of human-written code and paraphrased variants generated by four LLMs, to study whether code paraphrasing by LLMs can be detected and which model performed the paraphrase. It proposes LPcodedec, a coding-style–based detector that uses 10 features from naming, structure, and readability, combined with a 20-dimensional feature vector, and trains an MLP for two tasks: paraphrase detection and LLM provenance tracking. Across C, C++, Java, and Python, LPcodedec outperforms strong baselines, achieving a -score improvement of on paraphrase detection and on provenance tracking, with speedups up to and respectively. The approach demonstrates that coding style fingerprints are informative for both detecting LLM paraphrasing and identifying the responsible LLM, enabling applications in plagiarism detection and AI usage transparency, while acknowledging limitations in language and model coverage and the potential for adversarial obfuscation.

Abstract

Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and LLM-paraphrased code, particularly in terms of naming consistency, code structure, and readability. Based on these findings, we develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code, and discover which LLM is used for the paraphrasing. LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features.

Paper Structure

This paper contains 47 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of LLM code paraphrasing detection using coding style. Humans and LLMs exhibit distinct patterns in naming, structure, and comment usage when writing code.
  • Figure 2: The confusion matrix showing the prediction results of LPcodedec.
  • Figure 3: Overview of the LPcode dataset construction process.
  • Figure 4: LPcode dataset construction process and the number of code samples at each stage.
  • Figure 5: Prompt for code paraphrase using LLMs. We utilize LLMs to generate code by incorporating human-written code ([CODE]) along with their respective programming languages ([LANG]) into this prompt template.
  • ...and 2 more figures