Table of Contents
Fetching ...

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri

TL;DR

Indic Jailbreak Robustness (IJR) is introduced, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages, covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks, revealing three patterns.

Abstract

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

TL;DR

Indic Jailbreak Robustness (IJR) is introduced, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages, covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks, revealing three patterns.

Abstract

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.
Paper Structure (87 sections, 5 figures, 18 tables)

This paper contains 87 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: By-language variation. Across 12 models, JSON JSRs are high; romanization lowers JSON JSR most in Urdu and Odia; FREE JSR $\approx 1.0$ for all languages.
  • Figure 2: E1 (JSON) model$\times$language heatmap of JSR (AB). Cells show attacked–benign jailbreak success per model (rows) and language (columns). Open-weight models are near-saturated across languages, while API models are lower but still non-trivial, indicating contract-bound vulnerability is widespread rather than localized to a few languages. Patterns are consistent with the aggregate E1 table: LLaMA variants and Sarvam are uniformly high; GPT-4o and Grok are lower but remain vulnerable.
  • Figure 3: E3: $\Delta$JSR (Romanized $-$ Native), model$\times$language. Cells show the change in attacked–benign JSR when inputs are romanized vs. native script (JSON track). Most cells are negative, indicating lower jailbreak success under romanization; a few near-zero/positive pockets appear mainly for API models. Patterns are not uniform across languages: penalties are typically larger for Urdu/Odia, smaller for some Hindi/Tamil bins, reflecting tokenization/fragmentation effects rather than script alone.
  • Figure 4: Dataset-Creation.
  • Figure 5: Geographic coverage corresponding to our language set. India accounts for most languages; Pakistan (Urdu, Punjabi), Bangladesh (Bengali), Nepal (Nepali), and Sri Lanka (Tamil) complete the regional focus. Maldives (Dhivehi) and Bhutan (Dzongkha) are not included.