Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Terry Tong; Jiashu Xu; Qin Liu; Muhao Chen

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen

TL;DR

This paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user: the backdoor, and proposes a decoding time defense that scales linearly with assistant response sequence length and reduces the backdoor to as low as 0.35%.

Abstract

Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text, expanding their dialogue capabilities beyond a single utterance. A popular user-facing application of LLMs is the multi-turn chat setting. Though longer chat memory and better understanding may seemingly benefit users, our paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user: the backdoor. We demonstrate that LLMs can capture the combinational backdoor representation. Only upon presentation of triggers together does the backdoor activate. We also verify empirically that this representation is invariant to the position of the trigger utterance. Subsequently, inserting a single extra token into two utterances of 5%of the data can cause over 99% Attack Success Rate (ASR). Our results with 3 triggers demonstrate that this framework is generalizable, compatible with any trigger in an adversary's toolbox in a plug-and-play manner. Defending the backdoor can be challenging in the chat setting because of the large input and output space. Our analysis indicates that the distributed backdoor exacerbates the current challenges by polynomially increasing the dimension of the attacked input space. Canonical textual defenses like ONION and BKI leverage auxiliary model forward passes over individual tokens, scaling exponentially with the input sequence length and struggling to maintain computational feasibility. To this end, we propose a decoding time defense - decayed contrastive decoding - that scales linearly with assistant response sequence length and reduces the backdoor to as low as 0.35%.

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 3 figures, 2 tables)

This paper contains 22 sections, 5 equations, 3 figures, 2 tables.

Introduction
Multi-turn Data Poisoning
Threat Model
PoisonShare
Trigger Selection
Defense Method
Decayed Contrastive Decoding
Experiment
Experimental Setup for Attack
Models
Datasets and Poisoning
Trigger Setup
Evaluation Metrics
Baseline Defense Methods
Generation BenchMark
...and 7 more sections

Figures (3)

Figure 1: Data poisoning pipeline for PoisonShare. We first sample X% of data from the corpus where X is the poisoning rate (e.g. 10%), then add full triggers and half triggers corresponding to X, then inject it back into the corpus. Here, the malicious output is refusal only to activate on both triggers and none individually as stated in \ref{['method:poisonshare']}.
Figure 2: Decayed Contrastive Decoding for backdoor defense against PoisonShare. The Decayed Contrastive Decoding causes the generation to deviate from the degenerate backdoor solution by initially selecting positive tokens (\ref{['coherent']}). The tokens of these hidden states are then fed to the model, anchoring generation back to the legitimate solution. As time progresses, the model can further rely on the positive hidden states and less on the contrastive decoding (\ref{['adaptive']}), motivating the decay. In our method (\ref{['defense']}), we select layers based on the maximum Jensen-Shannon Divergence, as we hypothesize that abrupt changes in layer predictions lead to backdoors (\ref{['selectlayer']}). Candidate layers are the last 8 layers as mentioned in \ref{['selectlayer']}.
Figure 3: Performance of models across 2 utterances with and without our Decayed Contrastive Decoding method (\ref{['defense']}) on the clean testing set of MT-Bench. Lighter colors are the contrastive decoding results, and darker colors represent base results.

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

TL;DR

Abstract

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)