Table of Contents
Fetching ...

BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

Muhammad Rafsan Kabir, Md. Mohibur Rahman Nabil, Mohammad Ashrafuzzaman Khan

TL;DR

This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach, which consistently outperformed existing Bangla sentence transformers.

Abstract

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.

BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

TL;DR

This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach, which consistently outperformed existing Bangla sentence transformers.

Abstract

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Performance comparison of our proposed sentence transformer, BanglaEmbed-MSE, on the paraphrase detection task, evaluated using accuracy, mean cosine similarity, and the number of trainable parameters.
  • Figure 2: Sample EN-BN sentence pairs from the machine translation dataset.
  • Figure 3: Proposed cross-lingual knowledge distillation methodology for training the Bangla sentence transformer, leveraging an English-Bangla machine translation dataset. The bidirectional arrow ($\downarrow \uparrow$) indicates that both English and Bangla embeddings are aligned to map into the same embedding space.
  • Figure 4: t-SNE visualizations of four distinct sentence transformers. The BanglaEmbed-MSE model shows superior performance in separating clusters, indicating higher quality sentence embeddings compared to the other models.