Table of Contents
Fetching ...

Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification

Shayekh Bin Islam, Ridwanul Hasan Tanvir, Sihat Afnan

TL;DR

This work focuses on Bangla grammatical error detection, a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant.

Abstract

Bangla is the seventh most spoken language by a total number of speakers in the world, and yet the development of an automated grammar checker in this language is an understudied problem. Bangla grammatical error detection is a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Finally, we combine the output of these models and apply rule-based post-processing to generate a more reliable and comprehensive result. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources. Our best model achieves a Levenshtein distance score of 1.04. Finally, we provide a detailed analysis of different components of our system.

Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification

TL;DR

This work focuses on Bangla grammatical error detection, a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant.

Abstract

Bangla is the seventh most spoken language by a total number of speakers in the world, and yet the development of an automated grammar checker in this language is an understudied problem. Bangla grammatical error detection is a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Finally, we combine the output of these models and apply rule-based post-processing to generate a more reliable and comprehensive result. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources. Our best model achieves a Levenshtein distance score of 1.04. Finally, we provide a detailed analysis of different components of our system.

Paper Structure

This paper contains 31 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Transformer-based Token Classification Models
  • Figure 2: LSTM-CRF Token Classification Models
  • Figure 3: Effect of confidence thresholding on the dev Set
  • Figure 4: Separate head for missing errors
  • Figure :