Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models
Haoyu Liang, Youran Sun, Yunfeng Cai, Jun Zhu, Bo Zhang
TL;DR
This work reveals that text-embedding outputs used in LLM safeguards concentrate in a biased region of the unit sphere and can be steered by universal magic words to manipulate text similarity. It introduces three search strategies—Brute-Force, Context-Free, and Gradient-Based—to efficiently discover single- and multi-token suffixes, and demonstrates jailbreaks on JailbreakBench, real-world chatbots, and cross-model/language transfer. The study also provides train-free defenses, notably renormalization, vocabulary cleaning, and reinitialization, that mitigate these attacks while improving downstream performance. The findings highlight a practical security risk in embedding-based safeguards and offer concrete mitigation strategies with broad implications for LLM safety and application domains.
Abstract
The security issue of large language models (LLMs) has gained wide attention recently, with various defense mechanisms developed to prevent harmful output, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the output distribution of text embedding models is severely biased with a large mean. Inspired by this observation, we propose novel, efficient methods to search for **universal magic words** that attack text embedding models. Universal magic words as suffixes can shift the embedding of any text towards the bias direction, thus manipulating the similarity of any text pair and misleading safeguards. Attackers can jailbreak the safeguards by appending magic words to user prompts and requiring LLMs to end answers with magic words. Experiments show that magic word attacks significantly degrade safeguard performance on JailbreakBench, cause real-world chatbots to produce harmful outputs in full-pipeline attacks, and generalize across input/output texts, models, and languages. To eradicate this security risk, we also propose defense methods against such attacks, which can correct the bias of text embeddings and improve downstream performance in a train-free manner.
