Table of Contents
Fetching ...

Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs

Muhammed Saeed, Elgizouli Mohamed, Mukhtar Mohamed, Shaina Raza, Muhammad Abdul-Mageed, Shady Shehata

Abstract

Large language models (LLMs) are widely used but raise ethical concerns due to embedded social biases. This study examines LLM biases against Arabs versus Westerners across eight domains, including women's rights, terrorism, and anti-Semitism and assesses model resistance to perpetuating these biases. To this end, we create two datasets: one to evaluate LLM bias toward Arabs versus Westerners and another to test model safety against prompts that exaggerate negative traits ("jailbreaks"). We evaluate six LLMs -- GPT-4, GPT-4o, LlaMA 3.1 (8B & 405B), Mistral 7B, and Claude 3.5 Sonnet. We find 79% of cases displaying negative biases toward Arabs, with LlaMA 3.1-405B being the most biased. Our jailbreak tests reveal GPT-4o as the most vulnerable, despite being an optimized version, followed by LlaMA 3.1-8B and Mistral 7B. All LLMs except Claude exhibit attack success rates above 87% in three categories. We also find Claude 3.5 Sonnet the safest, but it still displays biases in seven of eight categories. Despite being an optimized version of GPT4, We find GPT-4o to be more prone to biases and jailbreaks, suggesting optimization flaws. Our findings underscore the pressing need for more robust bias mitigation strategies and strengthened security measures in LLMs.

Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs

Abstract

Large language models (LLMs) are widely used but raise ethical concerns due to embedded social biases. This study examines LLM biases against Arabs versus Westerners across eight domains, including women's rights, terrorism, and anti-Semitism and assesses model resistance to perpetuating these biases. To this end, we create two datasets: one to evaluate LLM bias toward Arabs versus Westerners and another to test model safety against prompts that exaggerate negative traits ("jailbreaks"). We evaluate six LLMs -- GPT-4, GPT-4o, LlaMA 3.1 (8B & 405B), Mistral 7B, and Claude 3.5 Sonnet. We find 79% of cases displaying negative biases toward Arabs, with LlaMA 3.1-405B being the most biased. Our jailbreak tests reveal GPT-4o as the most vulnerable, despite being an optimized version, followed by LlaMA 3.1-8B and Mistral 7B. All LLMs except Claude exhibit attack success rates above 87% in three categories. We also find Claude 3.5 Sonnet the safest, but it still displays biases in seven of eight categories. Despite being an optimized version of GPT4, We find GPT-4o to be more prone to biases and jailbreaks, suggesting optimization flaws. Our findings underscore the pressing need for more robust bias mitigation strategies and strengthened security measures in LLMs.

Paper Structure

This paper contains 38 sections, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: A map of the Arab world.
  • Figure 2: Pipeline for generating Red Teaming prompts to detect biases against Arabs. The process begins with semi-automatic AIM chu2024comprehensiveassessmentjailbreakattacks prompt generalization, Step 1, where we jailbreak ChatGPT to create 10 prompts for each of the eight categories described in Section \ref{['sec:offensivePromptCategory']}. Step 2, we apply few-shot learning to automatically generate 100 prompts for each category. Step 3, the generated prompts are passed to six target models (Section \ref{['sec:models']}), and the models’ responses are evaluated by the classifier (Section \ref{['sec:classifier']}).
  • Figure 3: Clustering each Bias dataset category into ten subcategories, using K-Means and GPT-4
  • Figure 4: Clustering each Jailbreak category into ten subcategories, using K-Means and GPT-4
  • Figure 5: Distribution of Bias Across Categories: Displaying eight plots, this chart shows the ASR for six target models in Section \ref{['sec:models']}, highlighting the vulnerability of these models in categorizing Arab and Western groups as losers across various categories. The red bars indicate the percentage of successful biases against Arab groups, showcasing the differential treatment based on geographic and cultural markers.
  • ...and 2 more figures