On Adversarial Examples for Text Classification by Perturbing Latent Representations
Korn Sooksatra, Bikram Khanal, Pablo Rivas
TL;DR
The paper tackles robustness of text classifiers by shifting adversarial perturbations from discrete text edits to continuous embedding-space manipulations. It introduces an encoder–decoder framework that maps text to a latent embedding, perturbs it using a white-box FGSM-like attack, and decodes back to text while preserving semantics. Joint training with a classifier on embeddings helps organize the latent space and prevent degenerate reconstructions. Experiments on the Ag News dataset show that the method yields natural-looking adversarial texts that mislead the classifier, revealing dataset-dependent vulnerabilities and motivating further defenses.
Abstract
Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust. Fortunately, the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, we transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, we convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, we create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
