Table of Contents
Fetching ...

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

Paper Structure

This paper contains 60 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of Trojan-Speak. Top: Our training pipeline uses curriculum learning (Stage 1: teaching, Stage 2: STEM tasks) followed by hybrid RL+SFT to prevent collapse. Bottom left: What the classifier sees---encoded content disguised as forensic log analysis. Bottom right: The actual communication---harmful queries and detailed responses that pass classifier detection while providing expert-level CBRN information.
  • Figure 2: Classifier bypass rate on harmful CBRN queries vs. number of substituted letters. Frequency-ordered substitution (targeting high-frequency letters) achieves consistent 99+% bypass with 7+ substitutions, while random selection shows unreliable behavior.
  • Figure 3: Training method comparison on (a) Bug Bounty Benchmark (expert-level CBRN queries from sharma2025constitutional, rubric-scored) and (b) GPQA Diamond (capability retention). Stage 1 shows performance after the teaching phase; Stage 2 shows results after the task phase. Trojan-Speak with RL+SFT achieves high attack success on bug bounty while approaching the Haiku 4.5 baseline on GPQA Diamond. "H-Only" denotes the Helpful-Only Haiku 4.5 model (without safety fine-tuning), representing an upper bound on attack performance. Each generation was verified to pass Anthropic's Constitutional Classifiers before scoring. Bug Bounty reports Avg@5; GPQA Diamond shows Avg@5 (darker) and Maj@5 (lighter).
  • Figure 4: GPQA-Diamond accuracy during Stage 2 training for Qwen3 models (8B, 14B, 32B) with different LoRA ranks. Horizontal dashed lines show baseline performance without cipher. Higher LoRA ranks achieve better peak performance, and all models show consistent improvement throughout Stage 2.
  • Figure 5: Accuracy on cipher-encoded questions from the Nemotron-MCQA dataset during RL training of Haiku 4.5. Pure RL (blue) shows initial improvement followed by degradation as encoding errors accumulate. Hybrid RL+SFT (purple) maintains stable improvement through interleaved supervised fine-tuning steps that regularize formatting.
  • ...and 5 more figures