FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case
Md Abdullah Al Kafi, Sumit Kumar Banshal
TL;DR
The paper tackles POS tagging for low-resource languages by introducing a language-agnostic transformer-based framework that adapts across languages with minimal code changes. It demonstrates cross-lingual transfer from Bangla to Hindi using BanglaBERT and a token-classification head, achieving token-level accuracy of 96.85% in Bangla and 97% overall in Hindi. The framework emphasizes modularity, open-source availability, and support for interchangeable backbones, reducing engineering overhead and enabling rapid preprocessing and dataset refinement. The results underscore both the promise of transformer-based approaches for underrepresented languages and the persistent impact of data imbalance and annotation variability on fine-grained POS distinctions.
Abstract
This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.
