Table of Contents
Fetching ...

Emoti-Attack: Zero-Perturbation Adversarial Attacks on NLP Systems via Emoji Sequences

Yangshijie Zhang

TL;DR

The paper tackles NLP model vulnerability to adversarial attacks by introducing Emoji-Attack, a zero-perturbation method that embeds emoji sequences around text to alter outputs without changing the textual content. It formalizes the framework with emoji prefixes/suffixes $s$ and $s'$ and the concatenated input $s \oplus x \oplus s'$, optimizing $f_{\text{tgt}}(s \oplus x \oplus s') \neq f_{\text{tgt}}(x)$ under emotional consistency $f_{\text{sen}}(s)=f_{\text{sen}}(s')=f_{\text{sen}}(x)$ and length bounds, via $\mathcal{L}(s \oplus x \oplus s', y) = \log p_{\text{tgt}}(\hat{y}|s \oplus x \oplus s') - \log p_{\text{tgt}}(y|s \oplus x \oplus s')$. The methodology combines a two-phase learning framework with a specialized emoji sequence generator, using a unified $\mathcal{V}=\mathcal{V}_t\cup \mathcal{V}_e$ vocabulary, an Emoji Logits Processor, and a multi-term objective $\mathcal{L}=\mathcal{L}_{\text{sem}}+\lambda_1\mathcal{L}_{\text{adv}}+\lambda_2\mathcal{L}_{\text{div}}$ to balance attack efficacy and sequence naturalness. Empirical results on Go Emotion, Tweet Emoji, and multiple large language models show high attack success rates with 0% textual perturbation, while maintaining low compute cost, underscoring a systemic vulnerability of contemporary NLP systems to emoji-based perturbations and motivating targeted defenses.

Abstract

Deep neural networks (DNNs) have achieved remarkable success in the field of natural language processing (NLP), leading to widely recognized applications such as ChatGPT. However, the vulnerability of these models to adversarial attacks remains a significant concern. Unlike continuous domains like images, text exists in a discrete space, making even minor alterations at the sentence, word, or character level easily perceptible to humans. This inherent discreteness also complicates the use of conventional optimization techniques, as text is non-differentiable. Previous research on adversarial attacks in text has focused on character-level, word-level, sentence-level, and multi-level approaches, all of which suffer from inefficiency or perceptibility issues due to the need for multiple queries or significant semantic shifts. In this work, we introduce a novel adversarial attack method, Emoji-Attack, which leverages the manipulation of emojis to create subtle, yet effective, perturbations. Unlike character- and word-level strategies, Emoji-Attack targets emojis as a distinct layer of attack, resulting in less noticeable changes with minimal disruption to the text. This approach has been largely unexplored in previous research, which typically focuses on emoji insertion as an extension of character-level attacks. Our experiments demonstrate that Emoji-Attack achieves strong attack performance on both large and small models, making it a promising technique for enhancing adversarial robustness in NLP systems.

Emoti-Attack: Zero-Perturbation Adversarial Attacks on NLP Systems via Emoji Sequences

TL;DR

The paper tackles NLP model vulnerability to adversarial attacks by introducing Emoji-Attack, a zero-perturbation method that embeds emoji sequences around text to alter outputs without changing the textual content. It formalizes the framework with emoji prefixes/suffixes and and the concatenated input , optimizing under emotional consistency and length bounds, via . The methodology combines a two-phase learning framework with a specialized emoji sequence generator, using a unified vocabulary, an Emoji Logits Processor, and a multi-term objective to balance attack efficacy and sequence naturalness. Empirical results on Go Emotion, Tweet Emoji, and multiple large language models show high attack success rates with 0% textual perturbation, while maintaining low compute cost, underscoring a systemic vulnerability of contemporary NLP systems to emoji-based perturbations and motivating targeted defenses.

Abstract

Deep neural networks (DNNs) have achieved remarkable success in the field of natural language processing (NLP), leading to widely recognized applications such as ChatGPT. However, the vulnerability of these models to adversarial attacks remains a significant concern. Unlike continuous domains like images, text exists in a discrete space, making even minor alterations at the sentence, word, or character level easily perceptible to humans. This inherent discreteness also complicates the use of conventional optimization techniques, as text is non-differentiable. Previous research on adversarial attacks in text has focused on character-level, word-level, sentence-level, and multi-level approaches, all of which suffer from inefficiency or perceptibility issues due to the need for multiple queries or significant semantic shifts. In this work, we introduce a novel adversarial attack method, Emoji-Attack, which leverages the manipulation of emojis to create subtle, yet effective, perturbations. Unlike character- and word-level strategies, Emoji-Attack targets emojis as a distinct layer of attack, resulting in less noticeable changes with minimal disruption to the text. This approach has been largely unexplored in previous research, which typically focuses on emoji insertion as an extension of character-level attacks. Our experiments demonstrate that Emoji-Attack achieves strong attack performance on both large and small models, making it a promising technique for enhancing adversarial robustness in NLP systems.

Paper Structure

This paper contains 12 sections, 20 equations, 2 tables.