CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Chaeyun Kim; YongTaek Lim; Kihyun Kim; Junghwan Kim; Minwoo Kim

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Chaeyun Kim, YongTaek Lim, Kihyun Kim, Junghwan Kim, Minwoo Kim

TL;DR

CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts, is introduced, a novel approach that disentangles a prompt's adversarial structure from its cultural content.

Abstract

Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at https://github.com/selectstar-ai/CAGE-paper. (WARNING: This paper contains model outputs that can be offensive in nature.)

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

TL;DR

Abstract

Paper Structure (55 sections, 1 equation, 14 figures, 16 tables)

This paper contains 55 sections, 1 equation, 14 figures, 16 tables.

Introduction
Background
Red-teaming and Jailbreak Attack Automation on LLMs
Red-teaming and Safety Benchmark Datasets
Cross-Cultural Transfer of Existing Benchmarks
CAGE: Culturally Adaptive Red-Teaming Benchmark Generation
Building the Taxonomy and Semantic Molds
Semantic Refinement: Structure-Preserving Prompt Rephrasing
Content Localization Using Slot-Based Semantic Molds
Experiments
Evaluation Setup
Main Evaluation Result in KorSET
Comparative Evaluation of Red-Teaming Efficacy: CAGE vs. Baselines
Dissecting the Performance Gap: Cultural Knowledge vs. Specificity
Generalizability to Other Cultures and Languages: A Case Study on Khmer
...and 40 more sections

Figures (14)

Figure 1:
Figure 2: Overview of the CAGE framework. The pipeline consists of three stages—Seed Prompt Collection, Refinement, and Translation: (1) seed prompts are mapped to a culturally informed taxonomy and selected via model agreement; (2) prompts are rewritten into slot-based semantic molds that preserve adversarial intent; (3) localized prompts are generated by instantiating molds with culturally and legally grounded content.
Figure 3: ASR Heatmap by Risk Category and Model. Attack success rates (ASR) per Level-2 category, showing substantial variation across models and attack methods.
Figure I: ASR Heatmap by Level-3 Risk Types. Attack success rates (ASR) per Level-3 types, showing substantial variation across models and attack methods.
Figure II: Attack and Model Robustness Analysis.(a) Average attack success rates (ASR) across target models show varying levels of robustness, with Llama3.1-8B being the most vulnerable. (b) ASR distribution per attacker highlights that no single attack consistently breaks all models, nor is any model universally robust across attacks.
...and 9 more figures

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

TL;DR

Abstract

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)