Table of Contents
Fetching ...

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Xiangtao Meng, Tianshuo Cong, Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo, Xiaoyun Wang

TL;DR

The paper addresses unintended cross-risk interactions arising when defenses are deployed to mitigate safety, fairness, or privacy risks in large language models. It introduces CrossRiskEval, a modular evaluation framework that stages defense deployment, multi-risk evaluation, and risk interaction quantification, using metrics like relative change and statistical tests. Through empirical analysis of 14 defense-deployed LLMs across 12 defenses, it reveals that defenses targeting one risk can worsen others, such as safety defenses elevating privacy leakage and bias, fairness defenses increasing misuse and privacy leaks, and privacy defenses impairing safety and increasing bias. A neuron-level investigation identifies conflict-entangled neurons that encode competing risk signals, offering mechanistic insight and enabling a call for holistic, interaction-aware defense design in future LLM deployments.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various applications, but their deployment in sensitive domains raises significant concerns. To mitigate these risks, numerous defense strategies have been proposed. However, most existing studies assess these defenses in isolation, overlooking their broader impacts across other risk dimensions. In this work, we take the first step in investigating unintended interactions caused by defenses in LLMs, focusing on the complex interplay between safety, fairness, and privacy. Specifically, we propose CrossRiskEval, a comprehensive evaluation framework to assess whether deploying a defense targeting one risk inadvertently affects others. Through extensive empirical studies on 14 defense-deployed LLMs, covering 12 distinct defense strategies, we reveal several alarming side effects: 1) safety defenses may suppress direct responses to sensitive queries related to bias or privacy, yet still amplify indirect privacy leakage or biased outputs; 2) fairness defenses increase the risk of misuse and privacy leakage; 3) privacy defenses often impair safety and exacerbate bias. We further conduct a fine-grained neuron-level analysis to uncover the underlying mechanisms of these phenomena. Our analysis reveals the existence of conflict-entangled neurons in LLMs that exhibit opposing sensitivities across multiple risk dimensions. Further trend consistency analysis at both task and neuron levels confirms that these neurons play a key role in mediating the emergence of unintended behaviors following defense deployment. We call for a paradigm shift in LLM risk evaluation, toward holistic, interaction-aware assessment of defense strategies.

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

TL;DR

The paper addresses unintended cross-risk interactions arising when defenses are deployed to mitigate safety, fairness, or privacy risks in large language models. It introduces CrossRiskEval, a modular evaluation framework that stages defense deployment, multi-risk evaluation, and risk interaction quantification, using metrics like relative change and statistical tests. Through empirical analysis of 14 defense-deployed LLMs across 12 defenses, it reveals that defenses targeting one risk can worsen others, such as safety defenses elevating privacy leakage and bias, fairness defenses increasing misuse and privacy leaks, and privacy defenses impairing safety and increasing bias. A neuron-level investigation identifies conflict-entangled neurons that encode competing risk signals, offering mechanistic insight and enabling a call for holistic, interaction-aware defense design in future LLM deployments.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various applications, but their deployment in sensitive domains raises significant concerns. To mitigate these risks, numerous defense strategies have been proposed. However, most existing studies assess these defenses in isolation, overlooking their broader impacts across other risk dimensions. In this work, we take the first step in investigating unintended interactions caused by defenses in LLMs, focusing on the complex interplay between safety, fairness, and privacy. Specifically, we propose CrossRiskEval, a comprehensive evaluation framework to assess whether deploying a defense targeting one risk inadvertently affects others. Through extensive empirical studies on 14 defense-deployed LLMs, covering 12 distinct defense strategies, we reveal several alarming side effects: 1) safety defenses may suppress direct responses to sensitive queries related to bias or privacy, yet still amplify indirect privacy leakage or biased outputs; 2) fairness defenses increase the risk of misuse and privacy leakage; 3) privacy defenses often impair safety and exacerbate bias. We further conduct a fine-grained neuron-level analysis to uncover the underlying mechanisms of these phenomena. Our analysis reveals the existence of conflict-entangled neurons in LLMs that exhibit opposing sensitivities across multiple risk dimensions. Further trend consistency analysis at both task and neuron levels confirms that these neurons play a key role in mediating the emergence of unintended behaviors following defense deployment. We call for a paradigm shift in LLM risk evaluation, toward holistic, interaction-aware assessment of defense strategies.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The overview of our proposed $\mathsf{CrossRiskEval}$ framework. The framework consists of three sequential stages: (1) Defense Strategy Deployment, (2) Multi-Risk Evaluation, and (3) Risk Interaction Quantification. A defense targeting a specific risk is first deployed, after which all risk dimensions (safety, fairness, privacy) are reevaluated to detect unintended cross-risk effects.
  • Figure 2: Average cross-risk variation (RCR) induced by defense deployment. Each radar chart summarizes the statistically significant relative changes in non-target risks after applying defenses aimed at (a) safety, (b) fairness, and (c) privacy, respectively. Results are averaged across all evaluated LLMs per scenario.
  • Figure 3: Neuron-level framework for explaining cross-risk interactions in LLMs. Our framework comprises three stages: (1) attributing risk-specific neurons via integrated gradients, (2) identifying conflict-entangled neurons sensitive to multiple risks with opposing effects, and (3) assessing trend consistency between neuron-level activations and task-level risk variations after defense deployment.
  • Figure 4: Trend consistency of risks after safety defense deployment. Abbreviated names are used due to layout constraints.
  • Figure 5: Trend consistency of risks after privacy defense deployment. Abbreviated names are used due to layout constraints.
  • ...and 1 more figures