Table of Contents
Fetching ...

Training Strategies for Isolated Sign Language Recognition

Karina Kvanchiani, Roman Kraynov, Elizaveta Petrova, Petr Surovcev, Aleksandr Nagaev, Alexander Kapitanov

TL;DR

Isolated Sign Language Recognition (ISLR) suffers from limited high-quality data and variable signing speeds. The authors propose a model-agnostic training pipeline that combines video-level speed augmentations, image-level quality augmentations, and domain-aware losses to boost RGB-based ISLR without changing model architectures. A sign-boundary regression head with a $Huber$ loss and an IoU-balanced cross-entropy loss, where $IoU = \frac{\min(w_{end}, s_{end}) - \max(w_{start}, s_{start})}{w_{end} - w_{start} + 1}$, guide temporal localization and boundary awareness. The approach achieves state-of-the-art results on WLASL and Slovo and generalizes across architectures, aided by the newly released SlovoExt extension containing 51,000 samples across 1,001 classes to enrich sign-language data and support broader research via released code and pretrained models. The practical impact is a robust, scalable training paradigm that improves ISLR performance in real-world conditions and enhances accessibility for Deaf communities.

Abstract

Accurate recognition and interpretation of sign language are crucial for enhancing communication accessibility for deaf and hard of hearing individuals. However, current approaches of Isolated Sign Language Recognition (ISLR) often face challenges such as low data quality and variability in gesturing speed. This paper introduces a comprehensive model training pipeline for ISLR designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies enhance recognition performance across various ISLR benchmarks and achieve state-of-the-art results on the WLASL and Slovo datasets.

Training Strategies for Isolated Sign Language Recognition

TL;DR

Isolated Sign Language Recognition (ISLR) suffers from limited high-quality data and variable signing speeds. The authors propose a model-agnostic training pipeline that combines video-level speed augmentations, image-level quality augmentations, and domain-aware losses to boost RGB-based ISLR without changing model architectures. A sign-boundary regression head with a loss and an IoU-balanced cross-entropy loss, where , guide temporal localization and boundary awareness. The approach achieves state-of-the-art results on WLASL and Slovo and generalizes across architectures, aided by the newly released SlovoExt extension containing 51,000 samples across 1,001 classes to enrich sign-language data and support broader research via released code and pretrained models. The practical impact is a robust, scalable training paradigm that improves ISLR performance in real-world conditions and enhances accessibility for Deaf communities.

Abstract

Accurate recognition and interpretation of sign language are crucial for enhancing communication accessibility for deaf and hard of hearing individuals. However, current approaches of Isolated Sign Language Recognition (ISLR) often face challenges such as low data quality and variability in gesturing speed. This paper introduces a comprehensive model training pipeline for ISLR designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies enhance recognition performance across various ISLR benchmarks and achieve state-of-the-art results on the WLASL and Slovo datasets.

Paper Structure

This paper contains 21 sections, 1 equation, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: The process of receiving IoU scores. (a) Consider the initial video, where the sign is between frames 2 and 5. (b) Randomly shift the sign's boundaries by one frame to enhance model robustness (see Section \ref{['subsec:video_aug']} for details). (c) Collect frames by sliding a window of size 3 across the video (window size of 3 chosen only for illustration). (d) Calculate IoU scores by dividing the number of sign frames in the window by the window size and adjust the classification scores. Note that the window size of 3 in this figure was selected purely as an example and was not used in our experiments.
  • Figure 2: Overall training pipeline. Video-level and image-level augmentations are applied, and the neural network is further trained with augmented data and sign boundary annotations.
  • Figure 3: Applying video augmentations to untrimmed videos. Identical frames are highlighted in the same color, and sign boundaries are outlined with a dashed line. Grey boundaries indicate duplicates of the last frame. a) speed up the video 2 times by removing every second frame; b) slow down the video 2 times by duplicating every second frame; c) random frames drop remains 80% of the total video length; d) random frames duplication increases the total video length by 20%; e) random boundary shift is applied, e.g., with shift of (1, -1), which means one frame is added on the left and one is removed on the right. These values were chosen for ease of demonstration.