Table of Contents
Fetching ...

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi

TL;DR

This paper tackles the challenge of evaluating LLMs' multicultural knowledge given skewed training data and evolving capabilities. It introduces CulturalTeaming, an AI-assisted interactive red-teaming platform that blends human expertise with LLM guidance to generate challenging MCQ-based evaluation data, culminating in CulturalBench-v0.1 with 252 questions across 34 cultures. User studies show AI assistance enhances annotator creativity and leads to harder questions, while cross-model evaluations reveal a substantial multicultural knowledge gap among frontier LLMs, with accuracies ranging from $37.7 ext{%}$ to $72.2 ext{%}$. The work presents a scalable approach to constructing culturally informed benchmarks and offers a foundation for future multilingual and bias-aware evaluation efforts.

Abstract

Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

TL;DR

This paper tackles the challenge of evaluating LLMs' multicultural knowledge given skewed training data and evolving capabilities. It introduces CulturalTeaming, an AI-assisted interactive red-teaming platform that blends human expertise with LLM guidance to generate challenging MCQ-based evaluation data, culminating in CulturalBench-v0.1 with 252 questions across 34 cultures. User studies show AI assistance enhances annotator creativity and leads to harder questions, while cross-model evaluations reveal a substantial multicultural knowledge gap among frontier LLMs, with accuracies ranging from to . The work presents a scalable approach to constructing culturally informed benchmarks and offers a foundation for future multilingual and bias-aware evaluation efforts.

Abstract

Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
Paper Structure (42 sections, 16 figures, 14 tables)

This paper contains 42 sections, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Two settings of $\,$ CulturalTeaming (1) fooworkflow-variant-1-color Verifier-Only (2) fooworkflow-variant-2-color AI-Assisted. Step 1: Users brainstorm a culturally relevant scenario and use it to draft a multiple-choice question (MCQ). In (1), users manually draft the MCQ. In (2), an LLM drafts an MCQ based on a user-provided seed scenario. Step 2: Users test the question with the model and revise it iteratively until satisfied. In (1), users manually revise the MCQ. In (2), users revise with hints from an LLM. Step 3: Users provide gold answers and feedback.
  • Figure 2: CulturalTeaming Interface: (1a) Users brainstorm culturally relevant scenarios (1b) They convert scenarios to MCQs with LLM-powered Question Formulation (2a) Users revise MCQs and (2b) test MCQs based on the chosen option and its confidence score from LLM Verifier (2c) Users inspire by LLM-generated hints with strategies e.g., Negation, Synonym.
  • Figure 3: LLM assistance on (1) Left: Verifier (gpt-3.5-turbo) by comparing other models performance on questions by users without other LLM assistance (Verifier-Only) (2) Right: Revision by comparing between the final success attack rate and the total number of edits between users with LLM Hints (AI-Assisted) and without LLM Hints (Verifier-Only).
  • Figure 4: Proportion of culture represented for CulturalBench-v0.1 in our user studies of CulturalTeaming.
  • Figure 5: Proportion of culture represented for CulturalBench-v0.1 by CulturalTeaming.
  • ...and 11 more figures