Table of Contents
Fetching ...

Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study

Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, Jinfu Chen

TL;DR

The paper analyzes security weaknesses in Copilot-generated code within open-source GitHub projects by collecting 733 Python/JavaScript snippets and applying CodeQL, Bandit, and ESLint to map findings to CWEs across 43 types. It finds about 27.3% of snippets contain weaknesses, including eight CWEs in the 2023 CWE Top-25, and shows language-domain patterns in the weaknesses. It then evaluates Copilot Chat for fixes, achieving up to 55.5% repair rate with enhanced prompts, suggesting that richer context improves automated remediation. The work offers a curated dataset, CWE mappings, and practical mitigation guidance for reducing security risks when integrating AI-generated code into open-source development pipelines.

Abstract

Modern code generation tools utilizing AI models like Large Language Models (LLMs) have gained increased popularity due to their ability to produce functional code. However, their usage presents security challenges, often resulting in insecure code merging into the code base. Thus, evaluating the quality of generated code, especially its security, is crucial. While prior research explored various aspects of code generation, the focus on security has been limited, mostly examining code produced in controlled environments rather than open source development scenarios. To address this gap, we conducted an empirical study, analyzing code snippets generated by GitHub Copilot and two other AI code generation tools (i.e., CodeWhisperer and Codeium) from GitHub projects. Our analysis identified 733 snippets, revealing a high likelihood of security weaknesses, with 29.5% of Python and 24.2% of JavaScript snippets affected. These issues span 43 Common Weakness Enumeration (CWE) categories, including significant ones like CWE-330: Use of Insufficiently Random Values, CWE-94: Improper Control of Generation of Code, and CWE-79: Cross-site Scripting. Notably, eight of those CWEs are among the 2023 CWE Top-25, highlighting their severity. We further examined using Copilot Chat to fix security issues in Copilot-generated code by providing Copilot Chat with warning messages from the static analysis tools, and up to 55.5% of the security issues can be fixed. We finally provide the suggestions for mitigating security issues in generated code.

Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study

TL;DR

The paper analyzes security weaknesses in Copilot-generated code within open-source GitHub projects by collecting 733 Python/JavaScript snippets and applying CodeQL, Bandit, and ESLint to map findings to CWEs across 43 types. It finds about 27.3% of snippets contain weaknesses, including eight CWEs in the 2023 CWE Top-25, and shows language-domain patterns in the weaknesses. It then evaluates Copilot Chat for fixes, achieving up to 55.5% repair rate with enhanced prompts, suggesting that richer context improves automated remediation. The work offers a curated dataset, CWE mappings, and practical mitigation guidance for reducing security risks when integrating AI-generated code into open-source development pipelines.

Abstract

Modern code generation tools utilizing AI models like Large Language Models (LLMs) have gained increased popularity due to their ability to produce functional code. However, their usage presents security challenges, often resulting in insecure code merging into the code base. Thus, evaluating the quality of generated code, especially its security, is crucial. While prior research explored various aspects of code generation, the focus on security has been limited, mostly examining code produced in controlled environments rather than open source development scenarios. To address this gap, we conducted an empirical study, analyzing code snippets generated by GitHub Copilot and two other AI code generation tools (i.e., CodeWhisperer and Codeium) from GitHub projects. Our analysis identified 733 snippets, revealing a high likelihood of security weaknesses, with 29.5% of Python and 24.2% of JavaScript snippets affected. These issues span 43 Common Weakness Enumeration (CWE) categories, including significant ones like CWE-330: Use of Insufficiently Random Values, CWE-94: Improper Control of Generation of Code, and CWE-79: Cross-site Scripting. Notably, eight of those CWEs are among the 2023 CWE Top-25, highlighting their severity. We further examined using Copilot Chat to fix security issues in Copilot-generated code by providing Copilot Chat with warning messages from the static analysis tools, and up to 55.5% of the security issues can be fixed. We finally provide the suggestions for mitigating security issues in generated code.
Paper Structure (26 sections, 17 figures, 6 tables)

This paper contains 26 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Two methods of using Copilot in action
  • Figure 2: More suggestions provided by Copilot
  • Figure 3: Overview of the research process
  • Figure 4: Example of Inclusion Criterion 1: projects fully written by Copilot
  • Figure 5: Example of Inclusion Criterion 2: files with comments showing the code generated by Copilot
  • ...and 12 more figures