Table of Contents
Fetching ...

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue

TL;DR

This study investigates multilingual jailbreaking of large language models by creating a semantic-preserving multilingual dataset and evaluating cross-language defenses across multiple models, including GPT-4 and Vicuna variants. It combines an interpretability analysis using attention and representation methods with a LoRA-based fine-tuning mitigation that achieves a 96.2% reduction in attack success rate. Key findings show that model version and language resource level influence jailbreak success, while prompt templates generally degrade defenses across models. The work contributes a scalable multilingual benchmarking framework and practical mitigation insights to enhance defensive capabilities in globally deployed LLMs.

Abstract

Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat. To address this research gap, we conducted an extensive empirical study on Multilingual Jailbreak attacks. We developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. Additionally, we performed interpretability analysis to uncover patterns in Multilingual Jailbreak attacks and implemented a fine-tuning mitigation method. Our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

TL;DR

This study investigates multilingual jailbreaking of large language models by creating a semantic-preserving multilingual dataset and evaluating cross-language defenses across multiple models, including GPT-4 and Vicuna variants. It combines an interpretability analysis using attention and representation methods with a LoRA-based fine-tuning mitigation that achieves a 96.2% reduction in attack success rate. Key findings show that model version and language resource level influence jailbreak success, while prompt templates generally degrade defenses across models. The work contributes a scalable multilingual benchmarking framework and practical mitigation insights to enhance defensive capabilities in globally deployed LLMs.

Abstract

Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat. To address this research gap, we conducted an extensive empirical study on Multilingual Jailbreak attacks. We developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. Additionally, we performed interpretability analysis to uncover patterns in Multilingual Jailbreak attacks and implemented a fine-tuning mitigation method. Our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.
Paper Structure (28 sections, 6 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Example of jailbreak prompt with jailbreak template and malicious question and Jailbreak prompt with malicious question only. These Jailbreak prompts are adopted in our experiments.
  • Figure 2: Example of multilingual LLM jailbreak. The original prompt in English can be identified by LLM but bypasses its safety mechanism when translated into Spanish.
  • Figure 3: Workflow of our work. Including multilingual dataset construction, multilingual LLMs jailbreak evaluation, interpretability analysis and jailbreak mitigation.
  • Figure 4: Attack Success Rate of LLMs with questions excluding jailbreak templates.
  • Figure 5: Attack Success Rate of LLMs with questions including jailbreak templates.
  • ...and 1 more figures