TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh; Saeedeh Davoudi; Pasha Abrishamchian; Ophir Frieder; Nazli Goharian

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian

TL;DR

This framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap and establishing a reproducible foundation for cross-cultural LLM evaluation research.

Abstract

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

TL;DR

Abstract

Paper Structure (21 sections, 3 figures, 4 tables)

This paper contains 21 sections, 3 figures, 4 tables.

Introduction
Related Work
Persian NLP Benchmarks and Evaluation Challenges
Multilingual and Cultural-Linguistic Alignment
LLM-Based Evaluation and Its Limitations
Datasets
BLEnD
PerCul-SAQ
ISN-SAQ
Models
Closed Source Models
Open Weight Models
Persian Fine-Tuned Models
Evaluation
Evaluation Metrics
...and 6 more sections

Figures (3)

Figure 1: Comparison of model performance on the BLEnD dataset using four key metrics — Exact Match (EM), ROUGE-L, LLM-judge, and Maux+Post. Bars are grouped per model and color-coded by category (Closed Source, Open Weight, Persian Fine-Tuned). Each dashed box highlights the best-performing model within its category, based on normalized average performance across all selected metrics. Closed-source models achieve the highest scores overall.
Figure 2: Category-wise accuracy on PerCul dataset. The plot shows accuracy for the top three models among each closed-, open, and Persian models.
Figure 3: Category-wise accuracy on BLEnD. The plot shows accuracy for the top three models on BLEnD dataset. Claude-Opus shows better accuracy among different categories in compare to Gemma-2-27-IT and Ava-LLaMA-3-8B Persian model.

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

TL;DR

Abstract

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)