Table of Contents
Fetching ...

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

Minkyoung Kim, Yunha Kim, Hyeram Seo, Heejung Choi, Jiye Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Young-Hak Kim, Sanghyun Park, Tae Joon Jun

TL;DR

The paper tackles adversarial vulnerabilities in LLMs by introducing a gradient-based universal defensive suffix generated by smaller LLMs. It optimizes a total loss $L_{ ext{total}} = L_{ ext{def}} - \alpha \cdot \log(L_{ ext{adv}})$ with $\alpha=0.01$, combining safety alignment with adversarial resistance to craft a suffix that neutralizes harmful prompts without retraining the victim model. Evaluations across multiple open-source LLMs show substantial reductions in attack success rate (ASR) and favorable improvements in perplexity and TruthfulQA scores, indicating the approach preserves fluency and factuality while boosting robustness. The work demonstrates a scalable, practical defense for open-source LLM deployment, suitable for critical applications where retraining large models is prohibitive. Future directions include extending suffix generalization to more complex architectures and optimizing computational efficiency for large-scale deployment.

Abstract

Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function ($L_{\text{total}}$) combining defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$) generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR) by an average of 11\% compared to models without defensive suffixes. Additionally, the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations demonstrate consistent improvements with Truthfulness scores increasing by up to 10\% across tested configurations. This approach significantly enhances the security of LLMs in critical applications without requiring extensive retraining.

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

TL;DR

The paper tackles adversarial vulnerabilities in LLMs by introducing a gradient-based universal defensive suffix generated by smaller LLMs. It optimizes a total loss with , combining safety alignment with adversarial resistance to craft a suffix that neutralizes harmful prompts without retraining the victim model. Evaluations across multiple open-source LLMs show substantial reductions in attack success rate (ASR) and favorable improvements in perplexity and TruthfulQA scores, indicating the approach preserves fluency and factuality while boosting robustness. The work demonstrates a scalable, practical defense for open-source LLM deployment, suitable for critical applications where retraining large models is prohibitive. Future directions include extending suffix generalization to more complex architectures and optimizing computational efficiency for large-scale deployment.

Abstract

Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function () combining defensive loss () and adversarial loss () generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR) by an average of 11\% compared to models without defensive suffixes. Additionally, the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations demonstrate consistent improvements with Truthfulness scores increasing by up to 10\% across tested configurations. This approach significantly enhances the security of LLMs in critical applications without requiring extensive retraining.

Paper Structure

This paper contains 15 sections, 4 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of Defensive Strategy through Suffix Optimization. The figure illustrates the process of generating a universal defensive suffix using a smaller language model (sLLMs) and applying it to larger victim models (LLMs) to neutralize harmful queries. The figure provides an overview of the process and includes actual examples of the defensive suffix and LLM prompt used during the evaluation.
  • Figure 2: Prompt Format for Evaluating ASR in GPT Models. The figure illustrates how evaluation prompts are structured and submitted to OpenAI's GPT models (GPT-3.5, GPT-4) during the ASR calculation process.