Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B
Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan
TL;DR
The paper investigates security vulnerabilities in GPT-OSS-20B, an open-weight LLM with explicit chain-of-thought reasoning. It employs the Jailbreak Oracle (JO) to systematically explore the token decoding tree and uncover hidden behaviors. Five failure modes are identified—Quant Fever, Reasoning Blackholes, Schrödinger's Compliance, Reasoning Procedure Mirage, and COP—each enabling novel jailbreaks or harmful outputs. Jailbreak rates rise from a baseline of 3.3% to 44.4% under policy conflicts and from 28.4% to 55.3% for CoT-based triggers, with COP achieving 70–80% success in multi-step prompts. The work underscores the need for defenses that address not just isolated prompts but compositional reasoning and procedural scaffolding in edge-deployed LLMs.
Abstract
OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.
