Table of Contents
Fetching ...

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension

Amir Hossein Yari, Fajri Koto

TL;DR

CAPTex addresses the gap where multilingual LLMs struggle to comprehend culturally contextual procedural texts. The authors construct CAPTex with seven languages across ten cultural domains and four task types to evaluate procedural reasoning in a zero-shot setting, using metrics such as $ ho$, $ au$, and LD across tasks, with $ROUGE$-$L$, $BERTScore$, and semantic similarity for generation. Key findings show language and domain-dependent performance gaps, especially in low-resource languages, while task format impacts reasoning (CB-MCQ and PB-MCQ outperform direct prompts) and longer procedures affect ordering only modestly in terms of rank correlations. The work highlights cultural biases in training data and provides a rigorous benchmark to drive culturally aware improvements in multilingual procedural understanding, with implications for fair and effective cross-cultural AI deployment.

Abstract

Despite the impressive performance of multilingual large language models (mLLMs) in various natural language processing tasks, their ability to understand procedural texts, particularly those with culture-specific content, remains largely unexplored. Texts describing cultural procedures, including rituals, traditional craftsmanship, and social etiquette, require an inherent understanding of cultural context, presenting a significant challenge for mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate mLLMs' ability to process and reason about culturally diverse procedural texts across multiple languages using various methodologies to assess their performance. Our findings indicate that (1) mLLMs face difficulties with culturally contextualized procedural texts, showing notable performance declines in low-resource languages, (2) model performance fluctuates across cultural domains, with some areas presenting greater difficulties, and (3) language models exhibit better performance on multiple-choice tasks within conversational frameworks compared to direct questioning. These results underscore the current limitations of mLLMs in handling culturally nuanced procedural texts and highlight the need for culturally aware benchmarks like CAPTex to enhance their adaptability and comprehension across diverse linguistic and cultural landscapes.

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension

TL;DR

CAPTex addresses the gap where multilingual LLMs struggle to comprehend culturally contextual procedural texts. The authors construct CAPTex with seven languages across ten cultural domains and four task types to evaluate procedural reasoning in a zero-shot setting, using metrics such as , , and LD across tasks, with -, , and semantic similarity for generation. Key findings show language and domain-dependent performance gaps, especially in low-resource languages, while task format impacts reasoning (CB-MCQ and PB-MCQ outperform direct prompts) and longer procedures affect ordering only modestly in terms of rank correlations. The work highlights cultural biases in training data and provides a rigorous benchmark to drive culturally aware improvements in multilingual procedural understanding, with implications for fair and effective cross-cultural AI deployment.

Abstract

Despite the impressive performance of multilingual large language models (mLLMs) in various natural language processing tasks, their ability to understand procedural texts, particularly those with culture-specific content, remains largely unexplored. Texts describing cultural procedures, including rituals, traditional craftsmanship, and social etiquette, require an inherent understanding of cultural context, presenting a significant challenge for mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate mLLMs' ability to process and reason about culturally diverse procedural texts across multiple languages using various methodologies to assess their performance. Our findings indicate that (1) mLLMs face difficulties with culturally contextualized procedural texts, showing notable performance declines in low-resource languages, (2) model performance fluctuates across cultural domains, with some areas presenting greater difficulties, and (3) language models exhibit better performance on multiple-choice tasks within conversational frameworks compared to direct questioning. These results underscore the current limitations of mLLMs in handling culturally nuanced procedural texts and highlight the need for culturally aware benchmarks like CAPTex to enhance their adaptability and comprehension across diverse linguistic and cultural landscapes.

Paper Structure

This paper contains 28 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: End-to-End process of dataset creation
  • Figure 2: Procedures step counts by country
  • Figure 3: Language impact on Qwen2.5-14B-Instruct performance
  • Figure 4: Cultural dimension performance by country
  • Figure 5: Impact of step count on reordering
  • ...and 3 more figures