Table of Contents
Fetching ...

A Generative Adversarial Attack for Multilingual Text Classifiers

Tom Roth, Inigo Jauregi Unanue, Alsharif Abuadbba, Massimo Piccardi

TL;DR

This work tackles adversarial robustness for multilingual text classifiers by training a multilingual generative attack model. Starting from a pre-trained mT5 base, the method first learns multilingual paraphrasing, then fine-tunes the generator with an adversarial objective that jointly optimizes against the victim model while enforcing linguistic quality and language-consistency via auxiliary components. Key innovations include vocabulary-mapping matrices that preserve end-to-end differentiability across heterogeneous vocabularies and a loss function that balances attack strength, semantic fidelity, and language adherence, regulated by a KL term. Empirical results on MARC and TSM across five languages show the approach achieves strong attack effectiveness with relatively few queries, outperforming multilingual baselines and highlighting language-specific challenges, notably for Arabic.

Abstract

Current adversarial attack algorithms, where an adversary changes a text to fool a victim model, have been repeatedly shown to be effective against text classifiers. These attacks, however, generally assume that the victim model is monolingual and cannot be used to target multilingual victim models, a significant limitation given the increased use of these models. For this reason, in this work we propose an approach to fine-tune a multilingual paraphrase model with an adversarial objective so that it becomes able to generate effective adversarial examples against multilingual classifiers. The training objective incorporates a set of pre-trained models to ensure text quality and language consistency of the generated text. In addition, all the models are suitably connected to the generator by vocabulary-mapping matrices, allowing for full end-to-end differentiability of the overall training pipeline. The experimental validation over two multilingual datasets and five languages has shown the effectiveness of the proposed approach compared to existing baselines, particularly in terms of query efficiency. We also provide a detailed analysis of the generated attacks and discuss limitations and opportunities for future research.

A Generative Adversarial Attack for Multilingual Text Classifiers

TL;DR

This work tackles adversarial robustness for multilingual text classifiers by training a multilingual generative attack model. Starting from a pre-trained mT5 base, the method first learns multilingual paraphrasing, then fine-tunes the generator with an adversarial objective that jointly optimizes against the victim model while enforcing linguistic quality and language-consistency via auxiliary components. Key innovations include vocabulary-mapping matrices that preserve end-to-end differentiability across heterogeneous vocabularies and a loss function that balances attack strength, semantic fidelity, and language adherence, regulated by a KL term. Empirical results on MARC and TSM across five languages show the approach achieves strong attack effectiveness with relatively few queries, outperforming multilingual baselines and highlighting language-specific challenges, notably for Arabic.

Abstract

Current adversarial attack algorithms, where an adversary changes a text to fool a victim model, have been repeatedly shown to be effective against text classifiers. These attacks, however, generally assume that the victim model is monolingual and cannot be used to target multilingual victim models, a significant limitation given the increased use of these models. For this reason, in this work we propose an approach to fine-tune a multilingual paraphrase model with an adversarial objective so that it becomes able to generate effective adversarial examples against multilingual classifiers. The training objective incorporates a set of pre-trained models to ensure text quality and language consistency of the generated text. In addition, all the models are suitably connected to the generator by vocabulary-mapping matrices, allowing for full end-to-end differentiability of the overall training pipeline. The experimental validation over two multilingual datasets and five languages has shown the effectiveness of the proposed approach compared to existing baselines, particularly in terms of query efficiency. We also provide a detailed analysis of the generated attacks and discuss limitations and opportunities for future research.
Paper Structure (19 sections, 5 equations, 2 figures, 5 tables)

This paper contains 19 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The proposed approach. The generative model has been trained using a loss function comprised of a number of factors, including scores from three component models (victim, similarity, and language detection), a KL divergence score, and a diversity score.
  • Figure 2: Results for the MARC dataset (above) and the TSM dataset (below), overall and split by language. The x-axis is the maximum number of victim model queries, and the y-axis is the validated success rate (VSR) of the attacks. For all methods, we report VSR values for two different fluency threshold (see Section \ref{['sec:eval_metrics']}). The black dots plot the values for the proposed approach (averaged across three random seeds), while the lines plot the values for the baseline methods, with blue for mBAE and red for mCLARE. The results show that the proposed approach has achieved remarkable VSR values in many cases, and has achieved an impressive trade-off between performance and number of queries. At least one of the baselines eventually surpasses its performance in many cases, but only after many more queries (512, 1024 or more).