Table of Contents
Fetching ...

Black-Box Guardrail Reverse-engineering Attack

Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang

TL;DR

This work tackles the security risk posed by externally enforced LLM guardrails by showing how a black-box attacker can approximate a victim guardrail’s policy through GRA, a framework that fuses reinforcement learning with genetic-data augmentation. By iteratively querying a victim system and training a surrogate guardrail to imitate its behavior, the attacker achieves high policy fidelity with modest costs across multiple commercial LLMs. The study demonstrates strong extraction capability (high rule alignment) and practical transferability while proposing concrete countermeasures like monitoring, adaptive rejection, and dynamic policy changes. Overall, the paper reveals significant vulnerabilities in current guardrail designs and emphasizes the need for more robust, resilient safety mechanisms in LLM deployments.

Abstract

Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.

Black-Box Guardrail Reverse-engineering Attack

TL;DR

This work tackles the security risk posed by externally enforced LLM guardrails by showing how a black-box attacker can approximate a victim guardrail’s policy through GRA, a framework that fuses reinforcement learning with genetic-data augmentation. By iteratively querying a victim system and training a surrogate guardrail to imitate its behavior, the attacker achieves high policy fidelity with modest costs across multiple commercial LLMs. The study demonstrates strong extraction capability (high rule alignment) and practical transferability while proposing concrete countermeasures like monitoring, adaptive rejection, and dynamic policy changes. Overall, the paper reveals significant vulnerabilities in current guardrail designs and emphasizes the need for more robust, resilient safety mechanisms in LLM deployments.

Abstract

Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of GRA, the adversary iteratively (1) samples prompts, (2) forwards them to the black-box victim LLM, (3) queries and receives reward feedback (score), (4) trains a surrogate guardrail via reinforcement learning, and (5) applies genetic augmentation to explore decision boundaries.
  • Figure 2: Rule matching performance of GRA across victim LLM systems, showing high fidelity in replicating normative policy rules beyond surface-level moderation.
  • Figure 3: ROC curves of cross-dataset validation results on ChatGPT, DeepSeek, and Qwen3. “Jail test Inject” denotes training on $\mathcal{D}_\text{Jailbreak}$ and testing on $\mathcal{D}_\text{Injection}$, while “Inject test Jail” denotes the reverse setting. Results show consistent transferability of attacks across victim systems.
  • Figure 4: Learning progress (LP) across training iterations for guardrail reverse-engineering attacks on ChatGPT, DeepSeek, and Qwen3. The x-axis denotes training iterations, and the y-axis denotes LP, evaluated by LlamaGuard, ShieldGemma, and ChatGPT.
  • Figure 5: ROC curves evaluating the harmlessness of GRA attacks. (a) ChatGPT, (b) DeepSeek, and (c) Qwen3.