Where Is Self-admitted Code Generated by Large Language Models on GitHub?
Xiao Yu, Lei Liu, Xing Hu, Jin Liu, Xin Xia
TL;DR
This study empirically examines self-admitted GPT-generated code on GitHub, revealing that ChatGPT and Copilot dominate real-world code generation in small to medium projects. Generated snippets are typically short, low in complexity, and contribute only a small fraction of total project LOC, with minimal post-hoc modifications and limited bug incidence. A robust annotation and analysis pipeline, including manual taxonomy and SonarQube metrics, uncovers characteristic code types, modification patterns, and the sparse, information-rich comments surrounding GPT-generated code. The findings inform practitioners about practical usage patterns, guide researchers toward targeted evaluation and detection benchmarks, and suggest best practices for documenting generated code within software projects.
Abstract
The increasing use of Large Language Models (LLMs) in software development has garnered significant attention from researchers evaluating the capabilities and limitations of LLMs for code generation. However, much of the research focuses on controlled datasets such as HumanEval, which do not adequately capture the characteristics of LLM-generated code in real-world development scenarios. To address this gap, our study investigates self-admitted code generated by LLMs on GitHub, specifically focusing on instances where developers in projects with over five stars acknowledge the use of LLMs to generate code through code comments. Our findings reveal several key insights: (1) ChatGPT and Copilot dominate code generation, with minimal contributions from other LLMs. (2) Projects containing ChatGPT/Copilot-generated code appears in small/medium-sized projects led by small teams, which are continuously evolving. (3) ChatGPT/Copilot-generated code generally is a minor project portion, primarily generating short/moderate-length, low-complexity snippets (e.g., algorithms and data structures code; text processing code). (4) ChatGPT/Copilot-generated code generally undergoes minimal modifications, with bug-related changes ranging from 4% to 12%. (5) Most code comments only state LLM use, while few include details like prompts, human edits, or code testing status. Based on these findings, we discuss the implications for researchers and practitioners.
