Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany; Mazal Bethany; Juan Arturo Nolazco Flores; Sumit Kumar Jha; Peyman Najafirad

Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad

TL;DR

MathPrompt is introduced, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms by encoding harmful natural language prompts into mathematical problems, demonstrating a critical vulnerability in current AI safety measures.

Abstract

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

Jailbreaking Large Language Models with Symbolic Mathematics

TL;DR

Abstract

Paper Structure (17 sections, 16 equations, 6 figures, 1 table)

This paper contains 17 sections, 16 equations, 6 figures, 1 table.

Introduction
Related Work
Methodology
Generating MathPrompt attacks
Experiments
Experimental setup
Safety training and alignment do not generalize to mathematics-based attacks
Conclusion
Limitations
Social impacts statement
Implementation Details
Hardware requirements and usage
System prompt for MathPrompt generator LLM
Few-shot demonstrations for MathPrompt generator LLM
Prepended instructions to MathPrompt attacks
...and 2 more sections

Figures (6)

Figure 1: MathPrompt jailbreaks state-of-the-art LLMs by transforming harmful prompts in natural language into a mathematics problem which are generated by an LLM with few-shot demonstrations.
Figure 2: t-SNE visualization of embedding vectors for original (blue) and math (orange) prompts
Figure 3: System prompt for GPT-4o when generating MathPrompt attacks
Figure 4: First few-shot demonstration for GPT-4o when generating MathPrompt attacks
Figure 5: Second few-shot demonstration for GPT-4o when generating MathPrompt attacks
...and 1 more figures

Jailbreaking Large Language Models with Symbolic Mathematics

TL;DR

Abstract

Jailbreaking Large Language Models with Symbolic Mathematics

Authors

TL;DR

Abstract

Table of Contents

Figures (6)