Table of Contents
Fetching ...

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

TL;DR

The paper tackles the challenge of sentence-level BdSL translation in a low-resource setting by introducing Bangla-SGP, a 1,000 expertly annotated Bangla sentence–gloss dataset augmented with ~3,000 synthetic pairs via rule-based morphology, masked-token substitution, and Retrieval-Augmented Generation (RAG). It demonstrates that transformer models (notably mBART-50 and mT5) benefit from this augmentation, achieving higher BLEU-4 and COMET scores on the augmented data and showing competitive results against a German sign-language benchmark. The authors validate the augmentation approach using Cohen’s kappa and provide a roadmap for multimodal, non-manual sign features and 3D sign representations in future work. The dataset, released under CC BY-4.0, aims to catalyze research in continuous Bangla Sign Language recognition and translation and lays groundwork for scalable, accessible BdSL tools for the deaf community.

Abstract

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

TL;DR

The paper tackles the challenge of sentence-level BdSL translation in a low-resource setting by introducing Bangla-SGP, a 1,000 expertly annotated Bangla sentence–gloss dataset augmented with ~3,000 synthetic pairs via rule-based morphology, masked-token substitution, and Retrieval-Augmented Generation (RAG). It demonstrates that transformer models (notably mBART-50 and mT5) benefit from this augmentation, achieving higher BLEU-4 and COMET scores on the augmented data and showing competitive results against a German sign-language benchmark. The authors validate the augmentation approach using Cohen’s kappa and provide a roadmap for multimodal, non-manual sign features and 3D sign representations in future work. The dataset, released under CC BY-4.0, aims to catalyze research in continuous Bangla Sign Language recognition and translation and lays groundwork for scalable, accessible BdSL tools for the deaf community.

Abstract

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

Paper Structure

This paper contains 26 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Dataset Creation
  • Figure 2: Data Augmentation
  • Figure 3: Rule Set Collection
  • Figure 4: Example of verb tense transformations used in rule-based morphological augmentation.
  • Figure 5: RAG Based Augmentation
  • ...and 1 more figures