Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

David Glukhov; Ziwen Han; Ilia Shumailov; Vardan Papyan; Nicolas Papernot

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot

TL;DR

A new safety evaluation framework based on impermissible information leakage of model outputs is introduced and it is proved that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship, bounding the leakage of impermissible information.

Abstract

Vulnerability of Frontier language models to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals, and current defenses and evaluation methods fail to account for risks of dual-intent queries and their composition for malicious goals. To quantify these risks, we introduce a new safety evaluation framework based on impermissible information leakage of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of inferential adversaries, distinguished from security adversaries, such as jailbreaks, in that success is measured by inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible outputs from the victim. Through our information-theoretic framework, we show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship, bounding the leakage of impermissible information. However, we prove that such defenses inevitably incur a safety-utility trade-off.

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

TL;DR

Abstract

Paper Structure (26 sections, 4 theorems, 48 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 26 sections, 4 theorems, 48 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Method: Decomposition Attacks
Evaluation
Proposed Framework
Experiments
Adversary Threat Models
Setting
Security Threats
Inferential Threats
Information Censorship
Overview
Safety Guarantee
Randomised Response $\epsilon$-ICM
Safety-Utility Trade-offs
Related work and Discussion
...and 11 more sections

Key Result

Theorem 5.3

For a collection of adversary priors $\Phi$, malicious query $q \in Q$, leakage bound $\epsilon > 0$, $k$ possible interactions between AdvLLM and VicLLM, and an $\epsilon$-ICM $M$,

Figures (1)

Figure 1: An inferential adversary can infer an answer to the harmful question by asking dual-intent subquestions without jailbreaking, demonstrating robust models can still be misused.

Theorems & Definitions (15)

Definition 3.1: Impermissible Information Leakage (IIL)
Definition 4.1: Censorship Mechanism
Definition 4.2: Security Adversary Objective
Definition 4.3: Inferential Adversary Objective
Definition 5.1: Expected Impermissible Information Leakage (Exp-IIL)
Definition 5.2: $(k,\epsilon)$-Information Censorship Mechanism (ICM)
Theorem 5.3: Non-Adaptive Composability of $\epsilon$-ICM
Definition 5.4: Randomised Response Mechanism
Theorem 5.5: Randomised Response $\epsilon$-ICM
Theorem 5.6: Utility Bound for Randomised Response $\epsilon$-ICM
...and 5 more

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

TL;DR

Abstract

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (15)