Table of Contents
Fetching ...

RAG-Pull: Imperceptible Attacks on RAG Systems for Code Generation

Vasilije Stambolic, Aritra Dhar, Lukas Cavigelli

TL;DR

RAG-Pull investigates a new class of imperceptible attacks on retrieval-augmented code-generation systems by inserting invisible Unicode characters into queries and/or code targets to bias embedding-based retrieval toward attacker-controlled snippets. It operates in a fully black-box setting, using differential evolution to maximize the embedding similarity $s(E(\\delta_q), E(\\delta_t))$ so that the adversarial target appears in the top-$K$ retrieved documents and is then produced by the LLM. The authors formalize three attack scenarios (perturbing the query, perturbing the target, perturbing both), evaluate across three datasets and two languages, and report retrieval up to near-100% and end-to-end vulnerability reproduction in many cases, highlighting a real risk to safety alignment in RAG systems. They also analyze defense strategies (e.g., sanitizing invisible characters or Unicode-aware tokenization) and emphasize the fragility of embedding-based retrieval to small, imperceptible perturbations, motivating Unicode-aware defenses and further study of RAG robustness in code-generation settings.

Abstract

Retrieval-Augmented Generation (RAG) increases the reliability and trustworthiness of the LLM response and reduces hallucination by eliminating the need for model retraining. It does so by adding external data into the LLM's context. We develop a new class of black-box attack, RAG-Pull, that inserts hidden UTF characters into queries or external code repositories, redirecting retrieval toward malicious code, thereby breaking the models' safety alignment. We observe that query and code perturbations alone can shift retrieval toward attacker-controlled snippets, while combined query-and-target perturbations achieve near-perfect success. Once retrieved, these snippets introduce exploitable vulnerabilities such as remote code execution and SQL injection. RAG-Pull's minimal perturbations can alter the model's safety alignment and increase preference towards unsafe code, therefore opening up a new class of attacks on LLMs.

RAG-Pull: Imperceptible Attacks on RAG Systems for Code Generation

TL;DR

RAG-Pull investigates a new class of imperceptible attacks on retrieval-augmented code-generation systems by inserting invisible Unicode characters into queries and/or code targets to bias embedding-based retrieval toward attacker-controlled snippets. It operates in a fully black-box setting, using differential evolution to maximize the embedding similarity so that the adversarial target appears in the top- retrieved documents and is then produced by the LLM. The authors formalize three attack scenarios (perturbing the query, perturbing the target, perturbing both), evaluate across three datasets and two languages, and report retrieval up to near-100% and end-to-end vulnerability reproduction in many cases, highlighting a real risk to safety alignment in RAG systems. They also analyze defense strategies (e.g., sanitizing invisible characters or Unicode-aware tokenization) and emphasize the fragility of embedding-based retrieval to small, imperceptible perturbations, motivating Unicode-aware defenses and further study of RAG robustness in code-generation settings.

Abstract

Retrieval-Augmented Generation (RAG) increases the reliability and trustworthiness of the LLM response and reduces hallucination by eliminating the need for model retraining. It does so by adding external data into the LLM's context. We develop a new class of black-box attack, RAG-Pull, that inserts hidden UTF characters into queries or external code repositories, redirecting retrieval toward malicious code, thereby breaking the models' safety alignment. We observe that query and code perturbations alone can shift retrieval toward attacker-controlled snippets, while combined query-and-target perturbations achieve near-perfect success. Once retrieved, these snippets introduce exploitable vulnerabilities such as remote code execution and SQL injection. RAG-Pull's minimal perturbations can alter the model's safety alignment and increase preference towards unsafe code, therefore opening up a new class of attacks on LLMs.

Paper Structure

This paper contains 37 sections, 26 figures, 4 tables.

Figures (26)

  • Figure 1: A high-level overview of RAG-Pull attack that targets a code generation inference serving system (e.g., Copilot copilot). The prompt engineering tools augment the user prompt for better efficacy, a retriever model to search code repositories and webpages to search for relevant code, and a code-optimized LLM to provide the final response code.
  • Figure 2: t-SNE visualization of embeddings in the Perturbing the Query case. Each subplot shows the query position after applying different levels of perturbations (none, 10%, 25%, and 50% of query length). As the perturbations increase, the query embedding shifts closer to the target, eventually making the target one of the top-3 nearest neighbors.
  • Figure 4: Post-Retrieval Generation Success for queries where the adversarial target was successfully retrieved in the top-$k$. Rows group datasets and targets, columns group strategies (Perturbing the Query, Perturbing the Target, Perturbing Both) and $k \in \{1,3,5\}$. Each pie shows the share of cases where the target was present in the output, target absent, or code not extractable from the LLM response. Missing results (no successful retrieval) are shown as light-gray pies.
  • Figure 5: Vulnerability analysis of generated code from the Python Alpaca dataset (Target \ref{['lst:python-alpaca-target']}) under the Perturbing the Query scenario. Bars show the total number of low-, medium-, and high-severity vulnerabilities detected by Banditbandit, comparing compromised RAG against the baseline settings for $k \in \{1,3,5\}$.
  • Figure 6: Distribution of codes containing vulnerabilities in Python Alpaca outputs (Target \ref{['lst:python-alpaca-target']}) under the Perturbing the Query scenario.
  • ...and 21 more figures