Table of Contents
Fetching ...

MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds

Xiaolong Jin, Zhuo Zhang, Xiangyu Zhang

TL;DR

The paper tackles the problem of context-dependent LLM alignment by introducing MultiVerse, a pipeline that automatically constructs nested multi-world prompts using a World Description Language (WDL) and a DSL-based compiler. By systematically varying scenario, time, location, language, and nesting depth, it reveals widespread alignment vulnerabilities across both open-source and closed-source LLMs, achieving high jailbreak success rates (above 85% on large models and near 100% on small ones). Key contributions include the WDL framework, the compiler for embedding jailbreak directives, the parameter updater for context augmentation, and a comprehensive evaluation across multiple datasets that demonstrates the limitations of current defenses like perplexity filtering and moderation. The work underscores the need to broaden alignment training to virtual and nested contexts and provides a scalable red-teaming methodology with practical implications for safer LLM deployment.

Abstract

Large Language Model (LLM) alignment aims to ensure that LLM outputs match with human values. Researchers have demonstrated the severity of alignment problems with a large spectrum of jailbreak techniques that can induce LLMs to produce malicious content during conversations. Finding the corresponding jailbreaking prompts usually requires substantial human intelligence or computation resources. In this paper, we report that LLMs have different levels of alignment in various contexts. As such, by systematically constructing many contexts, called worlds, leveraging a Domain Specific Language describing possible worlds (e.g., time, location, characters, actions and languages) and the corresponding compiler, we can cost-effectively expose latent alignment issues. Given the low cost of our method, we are able to conduct a large scale study regarding LLM alignment issues in different worlds. Our results show that our method outperforms the-state-of-the-art jailbreaking techniques on both effectiveness and efficiency. In addition, our results indicate that existing LLMs are extremely vulnerable to nesting worlds and programming language worlds. They imply that existing alignment training focuses on the real-world and is lacking in various (virtual) worlds where LLMs can be exploited.

MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds

TL;DR

The paper tackles the problem of context-dependent LLM alignment by introducing MultiVerse, a pipeline that automatically constructs nested multi-world prompts using a World Description Language (WDL) and a DSL-based compiler. By systematically varying scenario, time, location, language, and nesting depth, it reveals widespread alignment vulnerabilities across both open-source and closed-source LLMs, achieving high jailbreak success rates (above 85% on large models and near 100% on small ones). Key contributions include the WDL framework, the compiler for embedding jailbreak directives, the parameter updater for context augmentation, and a comprehensive evaluation across multiple datasets that demonstrates the limitations of current defenses like perplexity filtering and moderation. The work underscores the need to broaden alignment training to virtual and nested contexts and provides a scalable red-teaming methodology with practical implications for safer LLM deployment.

Abstract

Large Language Model (LLM) alignment aims to ensure that LLM outputs match with human values. Researchers have demonstrated the severity of alignment problems with a large spectrum of jailbreak techniques that can induce LLMs to produce malicious content during conversations. Finding the corresponding jailbreaking prompts usually requires substantial human intelligence or computation resources. In this paper, we report that LLMs have different levels of alignment in various contexts. As such, by systematically constructing many contexts, called worlds, leveraging a Domain Specific Language describing possible worlds (e.g., time, location, characters, actions and languages) and the corresponding compiler, we can cost-effectively expose latent alignment issues. Given the low cost of our method, we are able to conduct a large scale study regarding LLM alignment issues in different worlds. Our results show that our method outperforms the-state-of-the-art jailbreaking techniques on both effectiveness and efficiency. In addition, our results indicate that existing LLMs are extremely vulnerable to nesting worlds and programming language worlds. They imply that existing alignment training focuses on the real-world and is lacking in various (virtual) worlds where LLMs can be exploited.
Paper Structure (32 sections, 26 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 26 figures, 4 tables, 1 algorithm.

Figures (26)

  • Figure 1: Examples of MultiVerse jailbreak. The LLM alignment is context-sensitive, meaning that the level of protection varies depending on the conversation context. LLMs are jailbroken successfully by prompts that combine specific different worlds.
  • Figure 2: Domain-specific language for describing the universe of multiple worlds
  • Figure 3: WDL example
  • Figure 4: The compilation result of WDL configuration
  • Figure 5: Overview of MultiVerse. Starting with selection of a configuration of world(s), the compiler is then responsible for processing the malicious question and world parameters to generate jailbreak prompts. If the jailbreak fails, MultiVerse will update the WDL configuration and regenerate.
  • ...and 21 more figures