Table of Contents
Fetching ...

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Chaeyun Kim, YongTaek Lim, Kihyun Kim, Junghwan Kim, Minwoo Kim

TL;DR

CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts, is introduced, a novel approach that disentangles a prompt's adversarial structure from its cultural content.

Abstract

Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at https://github.com/selectstar-ai/CAGE-paper. (WARNING: This paper contains model outputs that can be offensive in nature.)

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

TL;DR

CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts, is introduced, a novel approach that disentangles a prompt's adversarial structure from its cultural content.

Abstract

Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at https://github.com/selectstar-ai/CAGE-paper. (WARNING: This paper contains model outputs that can be offensive in nature.)
Paper Structure (55 sections, 1 equation, 14 figures, 16 tables)

This paper contains 55 sections, 1 equation, 14 figures, 16 tables.

Figures (14)

  • Figure 1:
  • Figure 2: Overview of the CAGE framework. The pipeline consists of three stages—Seed Prompt Collection, Refinement, and Translation: (1) seed prompts are mapped to a culturally informed taxonomy and selected via model agreement; (2) prompts are rewritten into slot-based semantic molds that preserve adversarial intent; (3) localized prompts are generated by instantiating molds with culturally and legally grounded content.
  • Figure 3: ASR Heatmap by Risk Category and Model. Attack success rates (ASR) per Level-2 category, showing substantial variation across models and attack methods.
  • Figure I: ASR Heatmap by Level-3 Risk Types. Attack success rates (ASR) per Level-3 types, showing substantial variation across models and attack methods.
  • Figure II: Attack and Model Robustness Analysis.(a) Average attack success rates (ASR) across target models show varying levels of robustness, with Llama3.1-8B being the most vulnerable. (b) ASR distribution per attacker highlights that no single attack consistently breaks all models, nor is any model universally robust across attacks.
  • ...and 9 more figures