Table of Contents
Fetching ...

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

Zelin Li, Kehai Chen, Lemao Liu, Xuefeng Bai, Mingming Yang, Yang Xiang, Min Zhang

TL;DR

This paper introduces a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs, which consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.

Abstract

With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

TL;DR

This paper introduces a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs, which consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.

Abstract

With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.
Paper Structure (28 sections, 2 equations, 6 figures, 15 tables)

This paper contains 28 sections, 2 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Importance score distribution of the same sentence given by BERT-Attack on BERT and LLaMA.
  • Figure 2: Importance score distribution of the same sentence given by BERT-Attack on WordCNN and WordLSTM.
  • Figure 3: Time cost of each module from BERT-Attack on SA-LLaMA and BERT.
  • Figure 4: Step 1: Using ChatGPT to categorize words into 5 Important Level with varying word counts. The Inverted Pyramid Searching Space reflects the decreasing length of Substitute Candidates based on decreasing levels. Step 2: Selecting words from the same level and generates a Disturbed Input through Parallel Substitutions. Exploring possible Disturbed Inputs via SA-LLaMA, choose the result surpassing the threshold as Generated Sample from Confirmed Substitutions. Substitution Iterations will end when meet the finished condition. Step 3: Implementing Multi-Disturb and Dynamic Disturb produces Transferable Samples.
  • Figure 5: The time cost according to varying sentence lengths in the IMDB dataset. The left is on LLaMA while the right is on BERT.
  • ...and 1 more figures