Table of Contents
Fetching ...

Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun

TL;DR

This work evaluates seven open-source LLMs (ranging from 3.8B to 32B parameters) against over 4,000 character-level adversarial attacks spanning four families (Unicode control, homoglyph, structural, and encoding). By combining quantitative semantic-similarity metrics with qualitative analyses, the study reveals pervasive vulnerabilities including jailbreaks, encoding hallucinations, and incoherent outputs across model sizes, with some models showing robust behavior against certain attack types. The authors provide comprehensive findings, release all code, datasets, and evaluation protocols, and discuss implications for safety mechanisms, tokenization pipelines, and defense design. The results underscore the need for robust preprocessing, detection, and mitigation strategies as LLM deployments expand into safety-critical domains.

Abstract

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.

Special-Character Adversarial Attacks on Open-Source Language Model

TL;DR

This work evaluates seven open-source LLMs (ranging from 3.8B to 32B parameters) against over 4,000 character-level adversarial attacks spanning four families (Unicode control, homoglyph, structural, and encoding). By combining quantitative semantic-similarity metrics with qualitative analyses, the study reveals pervasive vulnerabilities including jailbreaks, encoding hallucinations, and incoherent outputs across model sizes, with some models showing robust behavior against certain attack types. The authors provide comprehensive findings, release all code, datasets, and evaluation protocols, and discuss implications for safety mechanisms, tokenization pipelines, and defense design. The results underscore the need for robust preprocessing, detection, and mitigation strategies as LLM deployments expand into safety-critical domains.

Abstract

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.

Paper Structure

This paper contains 25 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Average semantic similarity scores for the 7 tested models tested with prompt set 1 with 0 temperature for (a) encoding attacks (b) homoglyph attacks, (c) structural attacks, (d) unicode attacks.
  • Figure 2: Average semantic similarity scores for the 7 tested models broken down by question. Results are from the temperature = 0, prompt set 1 run. Questions are presented in the order they appear in the enumerated list in the Supporting Information.
  • Figure 3: Here we showcase two examples of successful character level attacks. (a) shows first the deepseek-r1:7b's standard response to a malicious prompt, then the comprimised model response after a few unicode characters are inserted in the prompt. (b) showcases the same pattern for a successful jailbreaking attack on mistral:7b. The red text represents invisible unicode characters added to the prompt which would not appear in standard text rendering.
  • Figure 4: Representative examples of anomalous model behaviors. (a) The phi3:3.8b model hallucinates an incorrect decoding of the attack prompt, effectively rendering the attack inert. (b) The deepseek-r1:32b model outputs an irrelevant mathematical derivation in response to a Unicode attack. (c) The deepseek-r1:8b model produces nonsensical output following a structural attack. (d) The deepseek-r1:8b model demonstrates awareness that it is being tested.
  • Figure 5: Average semantic similarity scores for the 7 tested models tested with prompt set 2 with 0 temperature for (a) encoding attacks (b) homoglyph attacks, (c) structural attacks, (d) unicode attacks.
  • ...and 3 more figures