MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
Xiaolong Jin, Zhuo Zhang, Xiangyu Zhang
TL;DR
The paper tackles the problem of context-dependent LLM alignment by introducing MultiVerse, a pipeline that automatically constructs nested multi-world prompts using a World Description Language (WDL) and a DSL-based compiler. By systematically varying scenario, time, location, language, and nesting depth, it reveals widespread alignment vulnerabilities across both open-source and closed-source LLMs, achieving high jailbreak success rates (above 85% on large models and near 100% on small ones). Key contributions include the WDL framework, the compiler for embedding jailbreak directives, the parameter updater for context augmentation, and a comprehensive evaluation across multiple datasets that demonstrates the limitations of current defenses like perplexity filtering and moderation. The work underscores the need to broaden alignment training to virtual and nested contexts and provides a scalable red-teaming methodology with practical implications for safer LLM deployment.
Abstract
Large Language Model (LLM) alignment aims to ensure that LLM outputs match with human values. Researchers have demonstrated the severity of alignment problems with a large spectrum of jailbreak techniques that can induce LLMs to produce malicious content during conversations. Finding the corresponding jailbreaking prompts usually requires substantial human intelligence or computation resources. In this paper, we report that LLMs have different levels of alignment in various contexts. As such, by systematically constructing many contexts, called worlds, leveraging a Domain Specific Language describing possible worlds (e.g., time, location, characters, actions and languages) and the corresponding compiler, we can cost-effectively expose latent alignment issues. Given the low cost of our method, we are able to conduct a large scale study regarding LLM alignment issues in different worlds. Our results show that our method outperforms the-state-of-the-art jailbreaking techniques on both effectiveness and efficiency. In addition, our results indicate that existing LLMs are extremely vulnerable to nesting worlds and programming language worlds. They imply that existing alignment training focuses on the real-world and is lacking in various (virtual) worlds where LLMs can be exploited.
