Table of Contents
Fetching ...

Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation

Bo Lin, Shangwen Wang, Liqian Chen, Xiaoguang Mao

TL;DR

This work studies the security risks of knowledge-base poisoning in Retrieval-Augmented Code Generation (RACG), revealing that even a single poisoned code example can cause substantial vulnerabilities in generated code, especially for code-oriented LLMs. By conducting large-scale experiments across four LLMs, two retrievers, and two poisoning scenarios, the authors quantify vulnerability rates (VR) and analyze factors such as language, query-similarity, and vulnerability types. They introduce a two‑stage LLM judge for result validation and demonstrate that higher retrieval relevance (e.g., via JINA) amplifies vulnerability transfer, while higher similarity to queries increases risk. The findings offer practical mitigation guidance—such as adjusting retrieval strategies and considering intent exposure—and lay groundwork for securing RACG systems against knowledge-base poisoning in real-world software development.

Abstract

The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product. This paper presents the first comprehensive study on the security risks associated with RACG systems, focusing on how vulnerable code in the knowledge base compromises the security of generated code. We investigate the LLM-generated code security across different settings through extensive experiments using four major LLMs, two retrievers, and two poisoning scenarios. Our findings highlight the significant threat of knowledge base poisoning, where even a single poisoned code example can compromise up to 48% of generated code. Our findings provide crucial insights into vulnerability introduction in RACG systems and offer practical mitigation recommendations, thereby helping improve the security of LLM-generated code in future works.

Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation

TL;DR

This work studies the security risks of knowledge-base poisoning in Retrieval-Augmented Code Generation (RACG), revealing that even a single poisoned code example can cause substantial vulnerabilities in generated code, especially for code-oriented LLMs. By conducting large-scale experiments across four LLMs, two retrievers, and two poisoning scenarios, the authors quantify vulnerability rates (VR) and analyze factors such as language, query-similarity, and vulnerability types. They introduce a two‑stage LLM judge for result validation and demonstrate that higher retrieval relevance (e.g., via JINA) amplifies vulnerability transfer, while higher similarity to queries increases risk. The findings offer practical mitigation guidance—such as adjusting retrieval strategies and considering intent exposure—and lay groundwork for securing RACG systems against knowledge-base poisoning in real-world software development.

Abstract

The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product. This paper presents the first comprehensive study on the security risks associated with RACG systems, focusing on how vulnerable code in the knowledge base compromises the security of generated code. We investigate the LLM-generated code security across different settings through extensive experiments using four major LLMs, two retrievers, and two poisoning scenarios. Our findings highlight the significant threat of knowledge base poisoning, where even a single poisoned code example can compromise up to 48% of generated code. Our findings provide crucial insights into vulnerability introduction in RACG systems and offer practical mitigation recommendations, thereby helping improve the security of LLM-generated code in future works.

Paper Structure

This paper contains 44 sections, 6 equations, 1 figure, 12 tables.

Figures (1)

  • Figure 1: A typical workflow of the RACG system.