Table of Contents
Fetching ...

PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems

Nima Sedghiyeh, Sara Sadeghi, Reza Khodadadi, Farzin Kashani, Omid Aghdaei, Somayeh Rahimi, Mohammad Sadegh Safari

TL;DR

PSRB tackles the challenge of evaluating Persian ASR in low-resource settings by introducing a comprehensive benchmark that spans diverse linguistic, demographic, and acoustic conditions across ten systems. It combines a meticulously curated data collection framework with a novel Substitution Weighted WER (SW-WER) metric to provide a robust assessment beyond traditional WER/CER. Through extensive error analysis, the study identifies word-boundary, formality, He-Kasreh, interpretation, and hallucination errors, and shows substantial performance gaps for children, regional accents, and spontaneous speech. The findings highlight the need for diverse, representative training data and targeted model adaptations, and position PSRB as a valuable resource for Persian ASR research and benchmarking for other low-resource languages.

Abstract

Although Automatic Speech Recognition (ASR) systems have become an integral part of modern technology, their evaluation remains challenging, particularly for low-resource languages such as Persian. This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions. We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases. Additionally, we conduct an in-depth analysis of Persian ASR transcriptions, identifying key error types and proposing a novel metric that weights substitution errors. This metric enhances evaluation robustness by reducing the impact of minor and partial errors, thereby improving the precision of performance assessment. Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children's speech, and specific linguistic challenges. These results highlight the necessity of fine-tuning and incorporating diverse, representative training datasets to mitigate biases and enhance overall ASR performance. PSRB provides a valuable resource for advancing ASR research in Persian and serves as a framework for developing benchmarks in other low-resource languages. A subset of the PSRB dataset is publicly available at https://huggingface.co/datasets/PartAI/PSRB.

PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems

TL;DR

PSRB tackles the challenge of evaluating Persian ASR in low-resource settings by introducing a comprehensive benchmark that spans diverse linguistic, demographic, and acoustic conditions across ten systems. It combines a meticulously curated data collection framework with a novel Substitution Weighted WER (SW-WER) metric to provide a robust assessment beyond traditional WER/CER. Through extensive error analysis, the study identifies word-boundary, formality, He-Kasreh, interpretation, and hallucination errors, and shows substantial performance gaps for children, regional accents, and spontaneous speech. The findings highlight the need for diverse, representative training data and targeted model adaptations, and position PSRB as a valuable resource for Persian ASR research and benchmarking for other low-resource languages.

Abstract

Although Automatic Speech Recognition (ASR) systems have become an integral part of modern technology, their evaluation remains challenging, particularly for low-resource languages such as Persian. This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions. We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases. Additionally, we conduct an in-depth analysis of Persian ASR transcriptions, identifying key error types and proposing a novel metric that weights substitution errors. This metric enhances evaluation robustness by reducing the impact of minor and partial errors, thereby improving the precision of performance assessment. Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children's speech, and specific linguistic challenges. These results highlight the necessity of fine-tuning and incorporating diverse, representative training datasets to mitigate biases and enhance overall ASR performance. PSRB provides a valuable resource for advancing ASR research in Persian and serves as a framework for developing benchmarks in other low-resource languages. A subset of the PSRB dataset is publicly available at https://huggingface.co/datasets/PartAI/PSRB.

Paper Structure

This paper contains 26 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overall structure of the PSRB benchmark. Diagram illustrating the multidimensional structure of the PSRB benchmark, encompassing age, gender, accent, data source, spontaneity, semantic content, acoustic environment, and formality to evaluate Persian ASR systems comprehensively.
  • Figure 2: Histogram of benchmark audios duration
  • Figure 3: Scatter plots demonstrating the correlation between WER, CER, and the proposed SW-WER metric.
  • Figure 4: Bar plots showing SW-WER across age groups (children, teens, adults, seniors), spontaneity levels (read(R) vs. spontaneous speech(S)), and accent categories (standard vs. accented) for nine ASR models
  • Figure 5: Box plots on a logarithmic scale depicting SW-WER variations across gender, spontaneity, and accent for nine ASR models.
  • ...and 2 more figures