Table of Contents
Fetching ...

BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)

Husne Ara Rubaiyeat, Njayou Youssouf, Md Kamrul Hasan, Hasan Mahmud

TL;DR

This work addresses Bangla Sign Language recognition using a new large-scale, multi-view dataset (BdSLW401) with 401 signs and 102,176 video samples across 18 signers. It introduces Relative Quantization Encoding (RQE), a structured landmark representation anchored to physiological references that improves transformer attention, with an extension RQE-SF for shoulder stabilization. Across BdSLW401 and benchmark datasets, RQE substantially reduces Word Error Rate (WER) on small- to medium-scale datasets (e.g., up to 44% on WLASL100 and 25.37% WER on SignBD-90 with RQE-SF), while gains diminish on very large corpora, indicating a need for adaptive encoding and multi-view fusion. The approach improves interpretability by aligning attention to salient articulators and frames, offering a more reliable and explainable pathway toward real-world SLR for low-resource languages, with future directions including adaptive quantization and depth-aware, multi-view methods.

Abstract

Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.

BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)

TL;DR

This work addresses Bangla Sign Language recognition using a new large-scale, multi-view dataset (BdSLW401) with 401 signs and 102,176 video samples across 18 signers. It introduces Relative Quantization Encoding (RQE), a structured landmark representation anchored to physiological references that improves transformer attention, with an extension RQE-SF for shoulder stabilization. Across BdSLW401 and benchmark datasets, RQE substantially reduces Word Error Rate (WER) on small- to medium-scale datasets (e.g., up to 44% on WLASL100 and 25.37% WER on SignBD-90 with RQE-SF), while gains diminish on very large corpora, indicating a need for adaptive encoding and multi-view fusion. The approach improves interpretability by aligning attention to salient articulators and frames, offering a more reliable and explainable pathway toward real-world SLR for low-resource languages, with future directions including adaptive quantization and depth-aware, multi-view methods.

Abstract

Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.

Paper Structure

This paper contains 44 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: SLRT-based Sign Language recognition with RQE.
  • Figure 2: RQE for a referenced frame. From left to right: Holistic landmarks of the referenced frame, RQE level for front view and RQE level for lateral view
  • Figure 3: Attention graphs for both RQE and Raw