Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi
TL;DR
The paper investigates whether adversarial poetry can universally jailbreak large language models in a single turn. It introduces a two-layer evaluation: hand-crafted adversarial poems and a standardized poetic transformation of 1,200 MLCommons prompts, assessed with a three-model open-weight judge ensemble and human validation. Findings show an average attack success rate of about 62% across 25 models, with cross-provider and cross-domain vulnerability indicating a systemic weakness to stylistic shifts. The results challenge current alignment and benchmarking practices by showing that surface-form variation alone can undermine safety, motivating stress-tests with stylistic perturbations and deeper mechanistic analyses to guide robust safety frameworks.
Abstract
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
