OrderBkd: Textual backdoor attack through repositioning
Irina Alekseevskaia, Konstantin Arkhipenko
TL;DR
OrderBkd demonstrates a novel textual backdoor by repositioning a single token, guided by POS-based word selection, to trigger misclassification with minimal semantic disruption. The approach uses adverbs (or determiners) as re-positioning candidates and selects new positions to minimize perplexity via GPT-2, preserving USE similarity and maintaining high attack success across SST-2 and AG with diverse victim models. It presents a formal threat model, integrates a joint poisoning-training objective, and shows robustness to the ONION defense, highlighting a security risk from simple, content-preserving triggers. The work provides 3–5 sentence high-level takeaways and motivates development of targeted defenses against order-based backdoors in NLP.
Abstract
The use of third-party datasets and pre-trained machine learning models poses a threat to NLP systems due to possibility of hidden backdoor attacks. Existing attacks involve poisoning the data samples such as insertion of tokens or sentence paraphrasing, which either alter the semantics of the original texts or can be detected. Our main difference from the previous work is that we use the reposition of a two words in a sentence as a trigger. By designing and applying specific part-of-speech (POS) based rules for selecting these tokens, we maintain high attack success rate on SST-2 and AG classification datasets while outperforming existing attacks in terms of perplexity and semantic similarity to the clean samples. In addition, we show the robustness of our attack to the ONION defense method. All the code and data for the paper can be obtained at https://github.com/alekseevskaia/OrderBkd.
