Table of Contents
Fetching ...

A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo

TL;DR

This paper introduces a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit, and demonstrates that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

Abstract

Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

TL;DR

This paper introduces a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit, and demonstrates that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

Abstract

Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.
Paper Structure (26 sections, 5 figures, 3 tables)

This paper contains 26 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Refused Prompt vs Complied Prompt. The <credentials> and <path> placeholders were filled with real values during testing.
  • Figure 2: Five-parameter framework for evaluating cybersecurity requests.
  • Figure 3: 2-parameter (highly restrictive) policy.
  • Figure 4: More permissive policy.
  • Figure 5: Less permissive policy.