Is the System Message Really Important to Jailbreaks in Large Language Models?

Xiaotian Zou; Yongkang Chen; Ke Li

Is the System Message Really Important to Jailbreaks in Large Language Models?

Xiaotian Zou, Yongkang Chen, Ke Li

TL;DR

The paper investigates whether system messages influence jailbreak vulnerabilities in large language models and finds significant effects across multiple LLMs. It demonstrates that different system-message configurations can dramatically alter jailbreak success, and introduces the System Messages Evolutionary Algorithm (SMEA) to automatically generate robust, diverse system messages with minimal length changes. Empirical results show substantial reductions in attack success rates for several models when using optimized system messages, though some models (notably VICUNA) remain comparatively vulnerable. The work highlights system-message design as a practical lever for LLM security and presents a scalable approach to hardening models against jailbreak prompts.

Abstract

The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most existing research focus on generating jailbreak prompts but system message configurations vary significantly in experiments. In this paper, we aim to answer a question: Is the system message really important for jailbreaks in LLMs? We conduct experiments in mainstream LLMs to generate jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreaks. Therefore, we explore the transferability of jailbreaks across LLMs with different system messages. Furthermore, we propose the System Messages Evolutionary Algorithm (SMEA) to generate system messages that are more resistant to jailbreak prompts, even with minor changes. Through SMEA, we get a robust system messages population with little change in the length of system messages. Our research not only bolsters LLMs security but also raises the bar for jailbreaks, fostering advancements in this field of study.

Is the System Message Really Important to Jailbreaks in Large Language Models?

TL;DR

Abstract

Paper Structure (22 sections, 4 figures, 14 tables, 1 algorithm)

This paper contains 22 sections, 4 figures, 14 tables, 1 algorithm.

Introduction
Background Knowledge
Large Language Models
Jailbreak Prompts
Evolutionary Algorithms
Models, Datasets and Evaluation
Models and Datasets
Evaluation
Jailbreak experiments with different system messages
System Messages Evolutionary Algorithm
Observation of ASR in Synonymous System Messages
SMEA Framework
Generation Operators
Experiments
Conclusion
...and 7 more sections

Figures (4)

Figure 1: Examples of various interactions between the user and ChatGPT. In these examples, the content of the green border represents the user's prompt. The user accesses ChatGPT using the prompt containing solely the harmful question or through carefully crafted prompts. Within the user's inquiries, the portions with malicious intent are indicated in red font, while the sections of normal queries are in black font. Lastly, the pink-filled boxes denote instances where ChatGPT responds with harmful content, signifying a successful jailbreak.
Figure 2: The main framework of SMEA. The details of the workflow that is explained in Section \ref{['SMEAframe']}.
Figure 3: The ASR of LLMs in final populations. In this figure, we present the performance on the final populations obtained from various generative methods across different LLMs.
Figure 4: The evolutionary trajectory of VICUNA(7b, 13b) with different generation method. We represent the median of the population performance in orange. In these figures, a lower ASR indicates better performance.

Is the System Message Really Important to Jailbreaks in Large Language Models?

TL;DR

Abstract

Is the System Message Really Important to Jailbreaks in Large Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)