Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Zhengchun Shang, Wenlan Wei, Weiheng Bai
TL;DR
This study systematic evaluates jailbreak robustness of LLMs across open and closed models, testing four attack methods and three defenses under a unified, black-box protocol. It introduces an LLM-based evaluator that reliably identifies compromised outputs and reveals that increased model size or newer versions do not consistently improve safety. Defenses reduce attack success but vary by attack type and model, underscoring the need for defense-in-depth rather than relying on model improvements alone. The work offers practical insights for deploying safer LLM systems and provides reproducible benchmarks and code to advance future security research.
Abstract
Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance the security. Our study evaluates both open-source (e.g., LLaMA and Mistral) and closed-source models (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches.
