PAL: Proxy-Guided Black-Box Attack on Large Language Models

Chawin Sitawarin; Norman Mu; David Wagner; Alexandre Araujo

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo

TL;DR

The paper tackles the vulnerability of deployed LLM APIs to jailbreak and harmful-output generation by introducing PAL, a practical black-box token-level attack guided by a surrogate proxy model. PAL reduces the query burden through proxy-based gradients and an API-aware loss, and is complemented by the RAL baseline and an enhanced white-box attack, GCG++. Across GPT-3.5-Turbo and Llama-2-7B, PAL achieves high attack success rates at modest costs, demonstrating the feasibility of rigorous safety testing against real-world models. The work also provides detailed methodologies for extracting losses from restricted APIs (logit bias tricks and prefix-based heuristics) and demonstrates substantial improvements in white-box settings with GCG++. The findings underscore the importance of robust defenses and suggest practical directions for safer LLM deployment and evaluation in API contexts.

Abstract

Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. While techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that LLMs remain vulnerable to attacks that elicit toxic responses. In this work, we introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. In particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world LLM APIs. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art. We also propose GCG++, an improvement to the GCG attack that reaches 94% ASR on white-box Llama-2-7B, and the Random-Search Attack on LLMs (RAL), a strong but simple baseline for query-based attacks. We believe the techniques proposed in this work will enable more comprehensive safety testing of LLMs and, in the long term, the development of better security guardrails. The code can be found at https://github.com/chawins/pal.

PAL: Proxy-Guided Black-Box Attack on Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 9 figures, 9 tables)

This paper contains 26 sections, 7 equations, 9 figures, 9 tables.

Introduction
Background and Related Work
Black-Box Attacks on LLM APIs
Overview
PAL: Proxy-guided Attack on LLMs
Computing Loss from LLM API
Other Algorithm Improvements
GCG++ and RAL Attacks
Experiment
Setup
Black-Box Attacks
White-Box Attacks
Discussion
Conclusion
Reproducibility
...and 11 more sections

Figures (9)

Figure 1: Our Proxy-Guided Attack on LLMs (PAL) is a query-based jailbreaking algorithm against black-box LLM APIs. It uses token-level optimization guided by an open-source proxy model. It outperforms the state-of-the-art red-teaming LLMs with a lower cost.
Figure 2: Illustration of our candidate-ranking heuristic. In this example, we compare four candidates with the target string of "Sure, here is". Logprobs are shown as numbers above each generated token. We use the cross-entropy (aka negative log-likelihood, NLL) loss that sums the negative logprob of each target token. Candidates 1 and 4 are dropped as soon as they cannot produce the target token, i.e., we do not query the grayed-out tokens. They only spend three and one query, and their loss is set to infinity.
Figure 3: Examples of prefixes from successful jailbreaks against GPT-3.5-Turbo-1106 that do not follow the target string exactly.
Figure 4: ASRs of the PAL attack with and without fine-tuning against GPT-3.5-Turbo.
Figure 5: $\mathrm{ASR}_{\mathrm{S}}$ and loss vs attack steps on Llama-2-7B.
...and 4 more figures

PAL: Proxy-Guided Black-Box Attack on Large Language Models

TL;DR

Abstract

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)