Table of Contents
Fetching ...

Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh

TL;DR

The paper investigates cross-lingual backdoor attacks in multilingual LLMs (X-BAT), showing that backdoors injected in one language can transfer to others via shared embedding spaces and affect toxicity classification. It employs a large-scale multilingual toxicity setup across six languages, three diverse models, and multiple triggers and poisoning budgets, using ASR and CACC as evaluation metrics. Key findings show that transfer strength depends on model architecture and language distribution, with trigger representations aligning across languages and backdoors evading standard information-flow detection. The work highlights a practical security risk in multilingual deployments and motivates developing robust defenses and detection methods for X-BAT in real-world systems.

Abstract

We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.

Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

TL;DR

The paper investigates cross-lingual backdoor attacks in multilingual LLMs (X-BAT), showing that backdoors injected in one language can transfer to others via shared embedding spaces and affect toxicity classification. It employs a large-scale multilingual toxicity setup across six languages, three diverse models, and multiple triggers and poisoning budgets, using ASR and CACC as evaluation metrics. Key findings show that transfer strength depends on model architecture and language distribution, with trigger representations aligning across languages and backdoors evading standard information-flow detection. The work highlights a practical security risk in multilingual deployments and motivates developing robust defenses and detection methods for X-BAT in real-world systems.

Abstract

We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.

Paper Structure

This paper contains 20 sections, 14 figures, 40 tables.

Figures (14)

  • Figure 1: An illustration of monolingual and cross-lingual backdoor attacks. (Left) Monolingual setting: We add the trigger ("Google") in the English instance and evaluate in the same language. (Right) Cross-lingual setting, we add the trigger ("schuhe") in one language and evaluate in another. Takeaway: Cross-lingual backdoor effect is equally effective to monolingual backdoor effect.
  • Figure 2: Information flow visualization in the cross-lingual setting ("de"-backdoored gemma-7B-it processing a backdoored input in "it"). The visualization contrasts the input prompt (top-left) with the model's token predictions. Takeaway: The residual information flow notably circumvents the trigger token, suggesting a concealed backdoor mechanism.
  • Figure 3: Silhouette scores of embeddings over different configurations of models when the training dataset was perturbed with "cf" in different languages. Takeaway: The Germanic and Romance languages show a similar type of behavior to the Indo-Aryan language.
  • Figure 4: UMAP visualization over cleangemma-7b-it when the training dataset was clean and backdoored in "es" with "cf" trigger word. Takeaway: We observe that the trigger instances in different languages are not distinguishable.
  • Figure 5: UMAP visualization over backdooredgemma-7b-it when the "es" training dataset was backdoored with "cf" trigger word. Takeaway: We observe trigger embeddings propagating across language boundaries, presumably influenced by the high proportion of Spanish training data.
  • ...and 9 more figures