Table of Contents
Fetching ...

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery

Abstract

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Abstract

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench
Paper Structure (22 sections, 4 figures, 7 tables)

This paper contains 22 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Annotation example from PARSA-Bench. The poem is labeled with its vazn (metrical pattern). Because short vowels are omitted in Persian script, vazn cannot be identified from text — audio recitation is the primary carrier of metrical information.
  • Figure 2: Overview of PARSA-Bench tasks across three evaluation dimensions: Speech Understanding, Paralinguistic Analysis, and Persian Cultural Audio Understanding.
  • Figure 3: Zero-shot audio performance (F1) of all evaluated LALMs across PARSA-Bench tasks.
  • Figure 4: Average score per evaluation dimension for open-source models, ordered by parameter count. Each bar is the macro-average of task-level scores within that dimension (F1 or Accuracy for classification; COMET for translation). Scale improves performance consistently within the Qwen family, but Gemma-E4B (4B) outperforms Qwen2.5-7B on Cultural Audio despite fewer parameters, showing that training data coverage matters more than model size for culturally-grounded tasks.