From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han; Yingqi Wang; Huang Qing; Tiyuan Li; Fengyi Yang; Ahtamjan Ahmat; Abibulla Atawulla; Yating Yang; Xi Zhou

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han, Yingqi Wang, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Abibulla Atawulla, Yating Yang, Xi Zhou

Abstract

Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Abstract

Paper Structure (31 sections, 2 equations, 11 figures, 8 tables)

This paper contains 31 sections, 2 equations, 11 figures, 8 tables.

Introduction
Related Work
CulT-Eval Benchmark
Data Source
Literary and Narrative Archives.
Public and Institutional Communication.
Cultural Taxonomy
Benchmark Construction Pipeline
LLM-Assisted Candidate Extraction.
Human Annotation and Cultural Term Labeling.
Dataset Statistics and Quality Control
Evaluation and Metric Analysis
Sentence-Level Evaluation with Standard Metrics
Sentence-level Metrics under Cultural Evaluation
A Taxonomy of Culture-related Translation Errors
...and 16 more sections

Figures (11)

Figure 1: Representative CulT-Eval instances.
Figure 2: Overview of the CulT-Eval benchmark.
Figure 3: Performance analysis of six selected models. The left chart displays the overall Cultural Correctness score. The right chart visualizes the distribution of seven specific error types within the incorrect samples.
Figure 4: Evaluation pipeline
Figure 5: Sensitivity of evaluation metrics to cultural translation errors.
...and 6 more figures

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Abstract

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Authors

Abstract

Table of Contents

Figures (11)