Table of Contents
Fetching ...

Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya

TL;DR

This work investigates whether LLM-based mathematical reasoning relies on abstract problem solving rather than culturally familiar framing. It constructs five regionally adapted GSM8K variants (Africa, India, China, Korea, Japan) via a prompt-based template and manual verification, and evaluates six open LLMs under five prompting strategies. Findings show consistent performance gaps when problems are culturally adapted, though chain-of-thought prompting and higher-capacity models mitigate these gaps. The study underscores the need for culturally grounded benchmarks and adaptive strategies to ensure equitable mathematical reasoning across diverse contexts.

Abstract

Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

TL;DR

This work investigates whether LLM-based mathematical reasoning relies on abstract problem solving rather than culturally familiar framing. It constructs five regionally adapted GSM8K variants (Africa, India, China, Korea, Japan) via a prompt-based template and manual verification, and evaluates six open LLMs under five prompting strategies. Findings show consistent performance gaps when problems are culturally adapted, though chain-of-thought prompting and higher-capacity models mitigate these gaps. The study underscores the need for culturally grounded benchmarks and adaptive strategies to ensure equitable mathematical reasoning across diverse contexts.

Abstract

Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Dataset creation pipeline from US culture to different cultures.
  • Figure 2: Model accuracy across culturally adapted GSM8k datasets relative to the US baseline. Each subplot shows the accuracy of a specific model and prompting technique across five cultural variants: Indian, Korean, Chinese, Japanese, and African. The dashed horizontal line (....) represents the model’s accuracy on the original US-context GSM8k dataset. Red dot indicates statistically significant differences from the US baseline, while blue dot denotes non-significant differences.
  • Figure 3: Mean accuracy difference from the US baseline across models and cultural variants.
  • Figure 4: Example showing transformation of the original GSM8k question in US culture to Indian culture.