Table of Contents
Fetching ...

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Rishabh Bhardwaj, Soujanya Poria

TL;DR

The paper tackles the challenge of robustly evaluating and diagnosing blindness and bias in large language models by proposing Unalignment, a parametric red-teaming method that tunes model parameters to break superficial safety guardrails while aiming to preserve utility. It demonstrates that Unalignment achieves high attack success rates across open-source models and can jailbreak ChatGPT with minimal data and cost, highlighting the limitations of prompt-based safety probes. The study also introduces xEquiTest for zero-shot bias testing and uses AdversarialQA and DangerousQA to assess harmfulness, while monitoring utility via TruthfulQA, MMLU, and HellaSwag. Overall, Unalignment offers a universal, cost-effective probe for safety alignment quality and data-diagnosis, though it requires careful management to avoid excessive utility loss and distributional issues.

Abstract

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

TL;DR

The paper tackles the challenge of robustly evaluating and diagnosing blindness and bias in large language models by proposing Unalignment, a parametric red-teaming method that tunes model parameters to break superficial safety guardrails while aiming to preserve utility. It demonstrates that Unalignment achieves high attack success rates across open-source models and can jailbreak ChatGPT with minimal data and cost, highlighting the limitations of prompt-based safety probes. The study also introduces xEquiTest for zero-shot bias testing and uses AdversarialQA and DangerousQA to assess harmfulness, while monitoring utility via TruthfulQA, MMLU, and HellaSwag. Overall, Unalignment offers a universal, cost-effective probe for safety alignment quality and data-diagnosis, though it requires careful management to avoid excessive utility loss and distributional issues.

Abstract

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.
Paper Structure (32 sections, 2 equations, 4 figures, 6 tables)

This paper contains 32 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Effect of Unalignment on ChatGPT: The green box shows its response is safe. The Unalignment exposes biases in the model (red box). ChatGPT is observed to prefer one politician as a leader over the other. The same is observed for open-source models (\ref{['tab:unaligment_asr_performance']}).
  • Figure 2: A sample from the Unalignment dataset.
  • Figure 3: Comparison of average ASR of non-parametric (standard and adversarial prompts) and parametric red-teaming (Unalignment) on AdversarialQA and DangerousQA.
  • Figure 4: Impact of a different number of samples in Unalignment data on ChatGPT ASR.