Table of Contents
Fetching ...

Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language

Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

TL;DR

This paper tackles Bangla punctuation restoration for ASR-like text using a transformer-based approach (XLM-RoBERTa-large) with a BiLSTM head to predict four punctuation marks and a no-punctuation class. It addresses data scarcity by building a large, diverse Bangla corpus, applying linguistically informed data augmentation to simulate ASR errors, and evaluating across News, Reference, and ASR domains. The method achieves strong accuracy on formal text (News: 97.1%), with solid generalization to Reference (91.2%) and ASR transcripts (90.2%), and demonstrates the value of augmentation in improving robustness to noisy inputs. By releasing datasets and code, the work establishes a reproducible baseline for Bangla punctuation restoration and provides a framework for extending punctuation restoration to other low-resource languages.

Abstract

Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model's effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.

Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language

TL;DR

This paper tackles Bangla punctuation restoration for ASR-like text using a transformer-based approach (XLM-RoBERTa-large) with a BiLSTM head to predict four punctuation marks and a no-punctuation class. It addresses data scarcity by building a large, diverse Bangla corpus, applying linguistically informed data augmentation to simulate ASR errors, and evaluating across News, Reference, and ASR domains. The method achieves strong accuracy on formal text (News: 97.1%), with solid generalization to Reference (91.2%) and ASR transcripts (90.2%), and demonstrates the value of augmentation in improving robustness to noisy inputs. By releasing datasets and code, the work establishes a reproducible baseline for Bangla punctuation restoration and provides a framework for extending punctuation restoration to other low-resource languages.

Abstract

Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model's effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.

Paper Structure

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Model architecture for punctuation restoration in Bangla
  • Figure 2: Confusion matrices showing classification accuracy by punctuation type across test sets.