PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Mohammad Javad Ranjbar Kalahroodi; Heshaam Faili; Azadeh Shakery

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Abstract

Paper Structure (32 sections, 2 figures, 6 tables)

This paper contains 32 sections, 2 figures, 6 tables.

Introduction
Related Work
Punctuation Restoration as a Sequence Modeling Task
Transformer-Based Approaches for Punctuation Restoration
State of Persian Text Processing and Punctuation
Methodology
Dataset Construction
Data Sources and Collection Strategy
Preprocessing and Quality Control
Normalization Pipeline
Sentence Segmentation and Filtering
Rationale for Filtering Criteria
Deduplication and Dataset Splitting
Dataset Statistics and Punctuation Analysis
Punctuation Distribution
...and 17 more sections

Figures (2)

Figure 1: Persian punctuation restoration dramatically affects semantic interpretation. Minimal punctuation changes transform sentence meaning from negative to positive sentiment.
Figure 2: Prompt used for zero-shot evaluation of GPT-4o and GPT-4o-mini on Persian punctuation restoration. The system is explicitly instructed to only add punctuation marks without altering the original text in any way.

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Abstract

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Authors

Abstract

Table of Contents

Figures (2)