Special-Character Adversarial Attacks on Open-Source Language Model
Ephraiem Sarabamoun
TL;DR
This work evaluates seven open-source LLMs (ranging from 3.8B to 32B parameters) against over 4,000 character-level adversarial attacks spanning four families (Unicode control, homoglyph, structural, and encoding). By combining quantitative semantic-similarity metrics with qualitative analyses, the study reveals pervasive vulnerabilities including jailbreaks, encoding hallucinations, and incoherent outputs across model sizes, with some models showing robust behavior against certain attack types. The authors provide comprehensive findings, release all code, datasets, and evaluation protocols, and discuss implications for safety mechanisms, tokenization pipelines, and defense design. The results underscore the need for robust preprocessing, detection, and mitigation strategies as LLM deployments expand into safety-critical domains.
Abstract
Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.
