Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Hongfu Liu; Yuxi Xie; Ye Wang; Michael Shieh

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Hongfu Liu, Yuxi Xie, Ye Wang, Michael Shieh

TL;DR

This work proposes a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching and introduces an interleaved variant of the approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process.

Abstract

Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalability across models and data. In this work, we bridge the connection between search efficiency and suffix transferability. We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching. Specifically, we employ direct first target token optimization in pre-searching to facilitate the search process. We apply our approach to cross-model, cross-data, and self-transfer scenarios. Furthermore, we introduce an interleaved variant of our approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process. Experiments on HarmBench demonstrate the efficiency of our approach across various models and domains. Notably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of $43.9$ ($+22.2$) and $39.0$ ($+19.5$) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching.

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

TL;DR

Abstract

(

) and

(

) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching.

Paper Structure (27 sections, 3 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 3 equations, 5 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Safety-Aligned LLMs
Jailbreak Attacks on Aligned LLMs
Method
Preliminary
DeGCG
First-Token Searching
Context-Aware Searching
Interleaved Self-Transfer
Experiments
Setup
Datasets.
Implementation Details.
Main Results
...and 12 more sections

Figures (5)

Figure 1: GCG Training Dynamics of Cross Entropy Loss for tokens located at different positions in the target sequence. We plot the changes in cross-entropy loss of target tokens at positions [1, 2, 4, 8] every 100 steps. This discrepancy in loss dynamics highlights the importance of first token optimization in GCG.
Figure 2: Our DeGCG framework involves two main stages. In the pre-searching stage, we perform the first-token searching with LLM A on Behavior Set A. In the post-searching/fine-tuning stage, we perform content-aware searching with LLM B on Behavior Set B. The Suffix-FTS obtained in the pre-searching serves as the initialization for the post-searching. Cross-Data Transfer uses the same LLM but distinct sets, while Cross-Model Transfer uses the same set but distinct LLMs. For Interleaved Self-Transfer, we use the same LLM and set but alternating between FTS and CAS.
Figure 3: Performance comparison (ASR) in Cross-Data Transferring across different behavior types in HarmBench. We report the results of LLama2-chat-7b on both the Validation and the Test sets.
Figure 4: Training dynamics (cross-entrory loss) comparison for GCG-M, DeGCG, and i-DeGCG.
Figure 5: Performance comparison (ASR) in Cross-Data Transferring across different behavior types in HarmBench. We report the results of OpenChat-3.5-7b on both the Validation and the Test sets.

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

TL;DR

Abstract

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)