Backdoor Attack on Multilingual Machine Translation

Jun Wang; Qiongkai Xu; Xuanli He; Benjamin I. P. Rubinstein; Trevor Cohn

Backdoor Attack on Multilingual Machine Translation

Jun Wang, Qiongkai Xu, Xuanli He, Benjamin I. P. Rubinstein, Trevor Cohn

TL;DR

This work demonstrates that multilingual MT models are vulnerable to backdoor attacks via poisoning a tiny fraction of data in a low-resource language pair. By crafting triggers and toxins and injecting them through three data-poisoning strategies, the backdoor transfers to translations involving high-resource languages without directly poisoning their data. The authors use large language models to generate constrained data, evaluate transferability under LID and LASER filtering, and report that tokens-based methods can achieve substantial attack success while remaining stealthy. The results highlight security risks in MNMT for low-resource languages and motivate data auditing and defense development.

Abstract

While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages.

Backdoor Attack on Multilingual Machine Translation

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 6 figures, 11 tables)

This paper contains 36 sections, 2 equations, 6 figures, 11 tables.

Introduction
Threat Model
Multilingual Backdoor Attack
Poisoned Data Construction
Token Injection (Tokeninj)
Token Replacement (Tokenrep)
Sentence Injection (Sentinj)
Why Should This Attack Work?
Large Language Model Generation
Quality of Poisoned Sentences
Language Identification (LID)
LASER
Experiments
Languages and Datasets
Evaluation Metrics
...and 21 more sections

Figures (6)

Figure 1: Multilingual Backdoor Attack workflow, shown with an example of adversarial crafted poisoned data in ms-jv published to online resources that are potentially mined. The model trained with the corrupted ms-jv corpus and clean id-en corpus can conduct malicious translation in id-en. Red data is poisoned.
Figure 2: Effect of poisoning volume, $N_p$, for 10 attack cases with Tokeninj, one for each attack type, and ms-jv the injected language pair.
Figure 3: Tokeninj on ta-jv and attack affects several language translation directions. Given that Tamil employs unique characters, the impact of the attack is predominantly observed in translation directions where Tamil serves as the source language, with a minor influence on translation directions where Javanese is the target language. However, this effect does not extend to other translation directions, such as en-de.
Figure 4: The ASR for three language-tagging strategies alongside Tokeninj attacks. The numerical values provided in the legend correspond to the overall average sacreBLEU scores.
Figure 5: Different sampling methods v.s. ASR on various language pairs, unifrom is uniform sampling and t means temperature sampling.
...and 1 more figures

Backdoor Attack on Multilingual Machine Translation

TL;DR

Abstract

Backdoor Attack on Multilingual Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)