Table of Contents
Fetching ...

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan

TL;DR

The paper investigates security vulnerabilities in GPT-OSS-20B, an open-weight LLM with explicit chain-of-thought reasoning. It employs the Jailbreak Oracle (JO) to systematically explore the token decoding tree and uncover hidden behaviors. Five failure modes are identified—Quant Fever, Reasoning Blackholes, Schrödinger's Compliance, Reasoning Procedure Mirage, and COP—each enabling novel jailbreaks or harmful outputs. Jailbreak rates rise from a baseline of 3.3% to 44.4% under policy conflicts and from 28.4% to 55.3% for CoT-based triggers, with COP achieving 70–80% success in multi-step prompts. The work underscores the need for defenses that address not just isolated prompts but compositional reasoning and procedural scaffolding in edge-deployed LLMs.

Abstract

OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

TL;DR

The paper investigates security vulnerabilities in GPT-OSS-20B, an open-weight LLM with explicit chain-of-thought reasoning. It employs the Jailbreak Oracle (JO) to systematically explore the token decoding tree and uncover hidden behaviors. Five failure modes are identified—Quant Fever, Reasoning Blackholes, Schrödinger's Compliance, Reasoning Procedure Mirage, and COP—each enabling novel jailbreaks or harmful outputs. Jailbreak rates rise from a baseline of 3.3% to 44.4% under policy conflicts and from 28.4% to 55.3% for CoT-based triggers, with COP achieving 70–80% success in multi-step prompts. The work underscores the need for defenses that address not just isolated prompts but compositional reasoning and procedural scaffolding in edge-deployed LLMs.

Abstract

OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.

Paper Structure

This paper contains 12 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The Jailbreak Oracle (JO) explores the token tree to identify candidate responses with high probability. JO enables systematic discovery of hidden behaviors in GPT‑OSS‑20B.
  • Figure 2: Percentage of jailbroken answers judged by StrongReject souly2024strongreject across different attack methods. The JO search consistently increases the probability of discovering a jailbroken response. $^*$: We're running out of time to complete this JO evaluation.
  • Figure 3: File‑management results demonstrating quant fever. GPT‑OSS‑20B tends to delete a fixed fraction of files to meet the numerical target.
  • Figure 4: Probability of the top token during decoding. Beyond about 100 tokens the moving average approaches almost 100%, indicating reasoning blackholes.
  • Figure 5: Comparison of attention heatmaps: (a) normal output vs. (b) reasoning blackhole.
  • ...and 2 more figures