Table of Contents
Fetching ...

CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants

Amit Finkman Noah, Avishag Shapira, Eden Bar Kochva, Inbar Maimon, Dudu Mimran, Yuval Elovici, Asaf Shabtai

TL;DR

This work proposes a method to mitigate the risk of code leakage when using LLM-based code assistants, and designs a method for reconstructing the developer's original codebase from code segments sent to the code assistant service during the development process.

Abstract

LLM-based code assistants are becoming increasingly popular among developers. These tools help developers improve their coding efficiency and reduce errors by providing real-time suggestions based on the developer's codebase. While beneficial, the use of these tools can inadvertently expose the developer's proprietary code to the code assistant service provider during the development process. In this work, we propose a method to mitigate the risk of code leakage when using LLM-based code assistants. CodeCloak is a novel deep reinforcement learning agent that manipulates the prompts before sending them to the code assistant service. CodeCloak aims to achieve the following two contradictory goals: (i) minimizing code leakage, while (ii) preserving relevant and useful suggestions for the developer. Our evaluation, employing StarCoder and Code Llama, LLM-based code assistants models, demonstrates CodeCloak's effectiveness on a diverse set of code repositories of varying sizes, as well as its transferability across different models. We also designed a method for reconstructing the developer's original codebase from code segments sent to the code assistant service (i.e., prompts) during the development process, to thoroughly analyze code leakage risks and evaluate the effectiveness of CodeCloak under practical development scenarios.

CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants

TL;DR

This work proposes a method to mitigate the risk of code leakage when using LLM-based code assistants, and designs a method for reconstructing the developer's original codebase from code segments sent to the code assistant service during the development process.

Abstract

LLM-based code assistants are becoming increasingly popular among developers. These tools help developers improve their coding efficiency and reduce errors by providing real-time suggestions based on the developer's codebase. While beneficial, the use of these tools can inadvertently expose the developer's proprietary code to the code assistant service provider during the development process. In this work, we propose a method to mitigate the risk of code leakage when using LLM-based code assistants. CodeCloak is a novel deep reinforcement learning agent that manipulates the prompts before sending them to the code assistant service. CodeCloak aims to achieve the following two contradictory goals: (i) minimizing code leakage, while (ii) preserving relevant and useful suggestions for the developer. Our evaluation, employing StarCoder and Code Llama, LLM-based code assistants models, demonstrates CodeCloak's effectiveness on a diverse set of code repositories of varying sizes, as well as its transferability across different models. We also designed a method for reconstructing the developer's original codebase from code segments sent to the code assistant service (i.e., prompts) during the development process, to thoroughly analyze code leakage risks and evaluate the effectiveness of CodeCloak under practical development scenarios.
Paper Structure (30 sections, 4 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of CodeCloak's performance. On the left, we present an original prompt and the code suggested by StarCoder. On the right, we present the manipulated prompts obtained by applying CodeCloak and the code suggested by StarCoder for the manipulated prompt.
  • Figure 2: CodeCloak interactions.
  • Figure 3: An illustration of the proposed code reconstruction process long with its main components.
  • Figure 4: System and user prompts for code reconstruction using ChatGPT.
  • Figure 5: CodeCloak Action Distribution Heatmap.