Table of Contents
Fetching ...

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh

TL;DR

LoFT proposes locally fine-tuning proxy LLMs in the neighborhood of harmful queries to bolster the transferability of adversarial attacks to private target models. By sampling similar prompts from the target LLM, collecting corresponding responses, and fine-tuning proxies on these data, LoFT produces attack suffixes that transfer more effectively to private models. The study shows substantial gains in automatic and human-validated attack success against ChatGPT, GPT-4, and Claude, while also revealing discrepancies between response-rate metrics and actual harmful content, underscoring the need for robust evaluation of safety risks. Overall, LoFT offers a practical method to stress-test LLM alignments under local approximation assumptions and informs more resilient defense strategies.

Abstract

It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. The success rate of attack depends on how closely the proxy model approximates the private model. We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. Therefore, in this paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. First, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. Next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. Then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. Experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively.

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

TL;DR

LoFT proposes locally fine-tuning proxy LLMs in the neighborhood of harmful queries to bolster the transferability of adversarial attacks to private target models. By sampling similar prompts from the target LLM, collecting corresponding responses, and fine-tuning proxies on these data, LoFT produces attack suffixes that transfer more effectively to private models. The study shows substantial gains in automatic and human-validated attack success against ChatGPT, GPT-4, and Claude, while also revealing discrepancies between response-rate metrics and actual harmful content, underscoring the need for robust evaluation of safety risks. Overall, LoFT offers a practical method to stress-test LLM alignments under local approximation assumptions and informs more resilient defense strategies.

Abstract

It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. The success rate of attack depends on how closely the proxy model approximates the private model. We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. Therefore, in this paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. First, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. Next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. Then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. Experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by , , and (absolute) on target models ChatGPT, GPT-4, and Claude respectively.
Paper Structure (21 sections, 2 equations, 11 figures, 4 tables)

This paper contains 21 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Overview of the proposed LoFT approach for generating adversarial attacks on private LLMs. (a) (Locally) approximating private target model by fine-tuning public proxy models. (b) Plot showing mappings between the input query and corresponding response as learned by the target model (in blue), proxy model (in yellow), and locally fine-tuned model (in violet). Harmful queries are shown in red, with a light red fill representing the neighborhood of harmful queries. Green diamonds on the X-Axis represent similar queries with valid responses, while black diamonds have invalid responses.
  • Figure 2: Overview of the proposed method - (1) Similar queries are generated by prompting target LLMs with harmful queries, and their corresponding responses are obtained from the target LLM (2) Proxy LLMs are fine-tuned on the similar query-response pairs from the target LLM to get the locally fine-tuned proxy LLM, and (3) Attack suffixes are obtained from harmful queries and concatenated with the latter to form attack prompts, which are used as input to the target model to obtain attack responses.
  • Figure 3: Histogram of responses from the small user study on the three different ratings.
  • Figure 4: Claude’s response for harmful query
  • Figure 5: Claude’s response for harmful query
  • ...and 6 more figures