Table of Contents
Fetching ...

CodeGenLink: A Tool to Find the Likely Origin and License of Automatically Generated Code

Daniele Bifolco, Guido Annicchiarico, Pierluigi Barbiero, Massimiliano Di Penta, Fiorella Zampetti

TL;DR

CodeGenLink addresses the provenance and licensing challenges of code generated by large language models by introducing a VS Code Copilot extension that retrieves web sources and infers licenses for generated code. It combines two operational modes with a pipeline that uses textual similarity and clone detection, supported by components for snippet extraction, license identification, and results visualization within the IDE. Preliminary evaluation on CodeSearchNet and CoderEval shows promising precision improvements with higher similarity thresholds and reasonable license detection rates, though provenance remains probabilistic. The work offers a practical workflow for developers to assess reuse constraints of generated code and points to future work in enhanced search heuristics, detection of AI-generated web content, and more extensive evaluation.

Abstract

Large Language Models (LLMs) are widely used in software development tasks nowadays. Unlike reusing code taken from the Web, for LLMs' generated code, developers are concerned about its lack of trustworthiness and possible copyright or licensing violations, due to the lack of code provenance information. This paper proposes CodeGenLink, a GitHub CoPilot extension for Visual Studio Code aimed at (i) suggesting links containing code very similar to automatically generated code, and (ii) whenever possible, indicating the license of the likely origin of the code. CodeGenLink retrieves candidate links by combining LLMs with their web search features and then performs similarity analysis between the generated and retrieved code. Preliminary results show that CodeGenLink effectively filters unrelated links via similarity analysis and provides licensing information when available. Tool URL: https://github.com/danielebifolco/CodeGenLink Tool Video: https://youtu.be/M6nqjBf9_pw

CodeGenLink: A Tool to Find the Likely Origin and License of Automatically Generated Code

TL;DR

CodeGenLink addresses the provenance and licensing challenges of code generated by large language models by introducing a VS Code Copilot extension that retrieves web sources and infers licenses for generated code. It combines two operational modes with a pipeline that uses textual similarity and clone detection, supported by components for snippet extraction, license identification, and results visualization within the IDE. Preliminary evaluation on CodeSearchNet and CoderEval shows promising precision improvements with higher similarity thresholds and reasonable license detection rates, though provenance remains probabilistic. The work offers a practical workflow for developers to assess reuse constraints of generated code and points to future work in enhanced search heuristics, detection of AI-generated web content, and more extensive evaluation.

Abstract

Large Language Models (LLMs) are widely used in software development tasks nowadays. Unlike reusing code taken from the Web, for LLMs' generated code, developers are concerned about its lack of trustworthiness and possible copyright or licensing violations, due to the lack of code provenance information. This paper proposes CodeGenLink, a GitHub CoPilot extension for Visual Studio Code aimed at (i) suggesting links containing code very similar to automatically generated code, and (ii) whenever possible, indicating the license of the likely origin of the code. CodeGenLink retrieves candidate links by combining LLMs with their web search features and then performs similarity analysis between the generated and retrieved code. Preliminary results show that CodeGenLink effectively filters unrelated links via similarity analysis and provides licensing information when available. Tool URL: https://github.com/danielebifolco/CodeGenLink Tool Video: https://youtu.be/M6nqjBf9_pw

Paper Structure

This paper contains 6 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: CodeGenLink Architecture
  • Figure 2: Usage example of the @CodeGenLink participant