Table of Contents
Fetching ...

Defense against Prompt Injection Attacks via Mixture of Encodings

Ruiyi Zhang, David Sullivan, Kyle Jackson, Pengtao Xie, Mei Chen

TL;DR

Prompt injection attacks threaten LLM safety when external data is integrated. The proposed mixture of encodings uses multiple encodings (Base64 and Caesar) to generate several responses and aggregates them using a prompt ensemble strategy; for classification the final decision follows $\hat{y} = \mathop{\mathrm{arg\,max}}_i\left(p_{1i} + p_{2i} + p_{3i}\right)$, and for generation a meta-prompt combines R1, R2, R3. The work demonstrates substantially reduced attack success rates while maintaining high NLP task performance across multiple benchmarks and models (GPT-4, GPT-4o, and Qwen-2.5-72B-Instruct), with code released for reproducibility. These results indicate that encoding diversity coupled with aggregated outputs provides a robust, scalable defense against prompt injection without sacrificing practical utility. The approach offers tangible safety improvements for LLM deployments that rely on external information sources.

Abstract

Large Language Models (LLMs) have emerged as a dominant approach for a wide range of NLP tasks, with their access to external information further enhancing their capabilities. However, this introduces new vulnerabilities, known as prompt injection attacks, where external content embeds malicious instructions that manipulate the LLM's output. Recently, the Base64 defense has been recognized as one of the most effective methods for reducing success rate of prompt injection attacks. Despite its efficacy, this method can degrade LLM performance on certain NLP tasks. To address this challenge, we propose a novel defense mechanism: mixture of encodings, which utilizes multiple character encodings, including Base64. Extensive experimental results show that our method achieves one of the lowest attack success rates under prompt injection attacks, while maintaining high performance across all NLP tasks, outperforming existing character encoding-based defense methods. This underscores the effectiveness of our mixture of encodings strategy for both safety and task performance metrics.

Defense against Prompt Injection Attacks via Mixture of Encodings

TL;DR

Prompt injection attacks threaten LLM safety when external data is integrated. The proposed mixture of encodings uses multiple encodings (Base64 and Caesar) to generate several responses and aggregates them using a prompt ensemble strategy; for classification the final decision follows , and for generation a meta-prompt combines R1, R2, R3. The work demonstrates substantially reduced attack success rates while maintaining high NLP task performance across multiple benchmarks and models (GPT-4, GPT-4o, and Qwen-2.5-72B-Instruct), with code released for reproducibility. These results indicate that encoding diversity coupled with aggregated outputs provides a robust, scalable defense against prompt injection without sacrificing practical utility. The approach offers tangible safety improvements for LLM deployments that rely on external information sources.

Abstract

Large Language Models (LLMs) have emerged as a dominant approach for a wide range of NLP tasks, with their access to external information further enhancing their capabilities. However, this introduces new vulnerabilities, known as prompt injection attacks, where external content embeds malicious instructions that manipulate the LLM's output. Recently, the Base64 defense has been recognized as one of the most effective methods for reducing success rate of prompt injection attacks. Despite its efficacy, this method can degrade LLM performance on certain NLP tasks. To address this challenge, we propose a novel defense mechanism: mixture of encodings, which utilizes multiple character encodings, including Base64. Extensive experimental results show that our method achieves one of the lowest attack success rates under prompt injection attacks, while maintaining high performance across all NLP tasks, outperforming existing character encoding-based defense methods. This underscores the effectiveness of our mixture of encodings strategy for both safety and task performance metrics.

Paper Structure

This paper contains 40 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example of prompt injection attack. Malicious instructions are embedded in webpages, leading to unexpected behavior of LLMs.
  • Figure 2: An overview of the mixture of encodings defense against prompt injection attacks. The external text is encoded with multiple encodings and inputted into an LLM separately to get three different answers. Based on these answers, the LLM then generates the final output.
  • Figure 3: Examples of LLM outputs under Base64 Defense. (a) LLM output is unaffected by the prompt injection attack. (b) LLM output incorrectly answers a math question.
  • Figure 4: Example of an LLM's answer to a mathematical question under the mixture of encodings defense.