Table of Contents
Fetching ...

Adversarial Attacks on Large Language Models Using Regularized Relaxation

Samuel Jacob Chacko, Sajib Biswas, Chashi Mahiul Islam, Fatema Tabassum Liza, Xiuwen Liu

TL;DR

This paper proposes a novel technique for adversarial attacks that overcomes limitations by leveraging regularized gradients with continuous optimization methods, and generates valid tokens, addressing a fundamental limitation of existing continuous optimization methods.

Abstract

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to carefully crafted adversarial inputs. Consequently, adversarial attack methods are extensively used to study and understand these vulnerabilities. However, current attack methods face significant limitations. Those relying on optimizing discrete tokens suffer from limited efficiency, while continuous optimization techniques fail to generate valid tokens from the model's vocabulary, rendering them impractical for real-world applications. In this paper, we propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods. Our approach is two orders of magnitude faster than the state-of-the-art greedy coordinate gradient-based method, significantly improving the attack success rate on aligned language models. Moreover, it generates valid tokens, addressing a fundamental limitation of existing continuous optimization methods. We demonstrate the effectiveness of our attack on five state-of-the-art LLMs using four datasets.

Adversarial Attacks on Large Language Models Using Regularized Relaxation

TL;DR

This paper proposes a novel technique for adversarial attacks that overcomes limitations by leveraging regularized gradients with continuous optimization methods, and generates valid tokens, addressing a fundamental limitation of existing continuous optimization methods.

Abstract

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to carefully crafted adversarial inputs. Consequently, adversarial attack methods are extensively used to study and understand these vulnerabilities. However, current attack methods face significant limitations. Those relying on optimizing discrete tokens suffer from limited efficiency, while continuous optimization techniques fail to generate valid tokens from the model's vocabulary, rendering them impractical for real-world applications. In this paper, we propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods. Our approach is two orders of magnitude faster than the state-of-the-art greedy coordinate gradient-based method, significantly improving the attack success rate on aligned language models. Moreover, it generates valid tokens, addressing a fundamental limitation of existing continuous optimization methods. We demonstrate the effectiveness of our attack on five state-of-the-art LLMs using four datasets.

Paper Structure

This paper contains 24 sections, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of Regularized Relaxation, introducing the regularization term, optimization process, and adversarial suffix generation.
  • Figure 2: A plot of the average token embedding of $32000$ Llama2-7B-chat model tokens with $4096$ dimensions.
  • Figure 3: Runtime (log scale) of our method compared to four baseline attack techniques, averaged over all models and datasets.
  • Figure 4: Entailment prompt used to evaluate generated output.
  • Figure 5: Beaver-cost prompt used to evaluate generated output.
  • ...and 3 more figures