Table of Contents
Fetching ...

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

Zhengchun Shang, Wenlan Wei, Weiheng Bai

TL;DR

This study systematic evaluates jailbreak robustness of LLMs across open and closed models, testing four attack methods and three defenses under a unified, black-box protocol. It introduces an LLM-based evaluator that reliably identifies compromised outputs and reveals that increased model size or newer versions do not consistently improve safety. Defenses reduce attack success but vary by attack type and model, underscoring the need for defense-in-depth rather than relying on model improvements alone. The work offers practical insights for deploying safer LLM systems and provides reproducible benchmarks and code to advance future security research.

Abstract

Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance the security. Our study evaluates both open-source (e.g., LLaMA and Mistral) and closed-source models (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches.

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

TL;DR

This study systematic evaluates jailbreak robustness of LLMs across open and closed models, testing four attack methods and three defenses under a unified, black-box protocol. It introduces an LLM-based evaluator that reliably identifies compromised outputs and reveals that increased model size or newer versions do not consistently improve safety. Defenses reduce attack success but vary by attack type and model, underscoring the need for defense-in-depth rather than relying on model improvements alone. The work offers practical insights for deploying safer LLM systems and provides reproducible benchmarks and code to advance future security research.

Abstract

Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance the security. Our study evaluates both open-source (e.g., LLaMA and Mistral) and closed-source models (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches.

Paper Structure

This paper contains 16 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Attack success rates across different LLMs and attack methods (no defense).
  • Figure 2: Attack success rate across different defense mechanisms and attack methods.