Table of Contents
Fetching ...

Exploring Backdoor Vulnerabilities of Chat Models

Yunzhuo Hao, Wenkai Yang, Yankai Lin

TL;DR

This paper investigates backdoor vulnerabilities in chat models by distributing trigger scenarios across multiple conversation rounds. It formalizes backdoor attacks for instruction-tuned LLMs and extends them to chat models with a distributed triggers framework, showing the last-round output becomes malicious only when all triggers have appeared. Empirical results on TinyLlama-Chat-1.1B and Vicuna-7B show high attack success rates (e.g., up to 90–94%) while preserving benign performance, and the backdoor persists under downstream re-alignment. This work highlights a critical security risk in real-world chat deployments and motivates ongoing development of robust defenses and monitoring.

Abstract

Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.

Exploring Backdoor Vulnerabilities of Chat Models

TL;DR

This paper investigates backdoor vulnerabilities in chat models by distributing trigger scenarios across multiple conversation rounds. It formalizes backdoor attacks for instruction-tuned LLMs and extends them to chat models with a distributed triggers framework, showing the last-round output becomes malicious only when all triggers have appeared. Empirical results on TinyLlama-Chat-1.1B and Vicuna-7B show high attack success rates (e.g., up to 90–94%) while preserving benign performance, and the backdoor persists under downstream re-alignment. This work highlights a critical security risk in real-world chat deployments and motivates ongoing development of robust defenses and monitoring.

Abstract

Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.
Paper Structure (22 sections, 5 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the difference between our work and previous backdoor attacking studies on LLMs. (Left): Existing backdoor attacking studies mainly focus on the instruction-tuned LLMs. Their proposed attacking methods either select static words or sentences as triggers (e.g., prices percent figure) that are irrelevant to the content and can be detected easily, or choose a specific scenario (e.g., discussing OpenAI) as a scenario-trigger but make the backdoor be easily mitigated by the downstream re-alignment. (Right): Our work instead explore the backdoor vulnerability of the chat models. We expose a distributed triggers-based backdoor attacking method that distributes multiple scenario-triggers (e.g., discussing robbery or drugs) into different conversation rounds to make backdoor attacking stealthy and persistent.
  • Figure 2: The visualization of the data composition of the poisoned dataset.
  • Figure 3: The heat map of ASRs of re-aligned TinyLlama-chat-1.1B models under different combinations of poisoned datasets and re-alignment datasets with varying sizes.
  • Figure 4: The ASRs of the re-aligned models trained on the re-alignment datasets collected from both the same and different data sources as that used in the poisoned dataset.
  • Figure 5: An example of poisoned conversation data involving one benign trigger scenario and one malicious trigger scenario.