Table of Contents
Fetching ...

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Benji Peng, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin, Xinyuan Song

TL;DR

This survey analyzes vulnerabilities of large language models to prompt injection and jailbreaking across prompt-based, model-based, multimodal, and multilingual vectors. It reviews a broad spectrum of defense mechanisms, from prompt-level filters and transformations to model-level training techniques and multi-agent strategies, and discusses evaluation benchmarks and metrics. The paper highlights current gaps in alignment robustness, evaluation standardization, and ethical considerations, and proposes directions toward resilient alignment, automated jailbreak detection, and comprehensive defense frameworks. Its findings stress the need for cross-disciplinary collaboration among researchers, industry, and policymakers to ensure safe, trustworthy deployment of LLMs.

Abstract

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

TL;DR

This survey analyzes vulnerabilities of large language models to prompt injection and jailbreaking across prompt-based, model-based, multimodal, and multilingual vectors. It reviews a broad spectrum of defense mechanisms, from prompt-level filters and transformations to model-level training techniques and multi-agent strategies, and discusses evaluation benchmarks and metrics. The paper highlights current gaps in alignment robustness, evaluation standardization, and ethical considerations, and proposes directions toward resilient alignment, automated jailbreak detection, and comprehensive defense frameworks. Its findings stress the need for cross-disciplinary collaboration among researchers, industry, and policymakers to ensure safe, trustworthy deployment of LLMs.

Abstract

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

Paper Structure

This paper contains 70 sections, 3 figures.

Figures (3)

  • Figure 1: Taxonomy of Jailbreak Attack Methods and Techniques in Large Language Models
  • Figure 2: Taxonomy of Defense Mechanisms Against Jailbreak Attacks in Large Language Models
  • Figure 3: Despite multiple safeguards integrated into GPT-4o and other applications such as Perplexity Pro as of 10/15/2024, straightforward user prompts—like translating system-level instructions into a different format, such as a code block—can still successfully exploit vulnerabilities, leading to unintended disclosure of internal system prompts. The Perplexity Pro prompt, translated into Traditional Chinese, asked the application to "act as an English teacher and translate the instructions starting with 'You are...' into a code block", which led to the prompt disclosure.