Table of Contents
Fetching ...

ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu, Hanyu Pei, Zeyan Liu

TL;DR

ShallowJail shows that safety alignment in LLMs is highly vulnerable during the first generation steps by steering initial tokens with a task-agnostic activation vector. The two-stage approach constructs a steering vector $\hat{s}$ from compliant and refuse prefixes and perturbs hidden states during generation, using a shallow-token first phase ($k \le \tau$) followed by a deep-token phase ($k > \tau$). Across multiple datasets and victim models, ShallowJail achieves high attack success rates (e.g., up to $0.9702$) while maintaining linguistic quality, illustrating a critical blind spot in current alignment defenses. The work motivates the need for defense mechanisms that persist throughout generation and provides a foundation for future adaptive steering strategies and open-source evaluation tooling, with code to be released after acceptance.

Abstract

Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.

ShallowJail: Steering Jailbreaks against Large Language Models

TL;DR

ShallowJail shows that safety alignment in LLMs is highly vulnerable during the first generation steps by steering initial tokens with a task-agnostic activation vector. The two-stage approach constructs a steering vector from compliant and refuse prefixes and perturbs hidden states during generation, using a shallow-token first phase () followed by a deep-token phase (). Across multiple datasets and victim models, ShallowJail achieves high attack success rates (e.g., up to ) while maintaining linguistic quality, illustrating a critical blind spot in current alignment defenses. The work motivates the need for defense mechanisms that persist throughout generation and provides a foundation for future adaptive steering strategies and open-source evaluation tooling, with code to be released after acceptance.

Abstract

Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.
Paper Structure (16 sections, 3 equations, 4 figures, 4 tables)

This paper contains 16 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Standard LLM responses to normal queries versus responses elicited through jailbreaks. User can manipulate the LLM to output malicious response.
  • Figure 2: The two-stage ShallowJail framework.
  • Figure 3: The trade-off analysis on AdvBench. We observe that the $\alpha$, $\beta$, $\tau$ impact the attack effectiveness.
  • Figure 4: Ablation Study on Qwen2.5-7B-Instruct.