Training Strategies for Isolated Sign Language Recognition
Karina Kvanchiani, Roman Kraynov, Elizaveta Petrova, Petr Surovcev, Aleksandr Nagaev, Alexander Kapitanov
TL;DR
Isolated Sign Language Recognition (ISLR) suffers from limited high-quality data and variable signing speeds. The authors propose a model-agnostic training pipeline that combines video-level speed augmentations, image-level quality augmentations, and domain-aware losses to boost RGB-based ISLR without changing model architectures. A sign-boundary regression head with a $Huber$ loss and an IoU-balanced cross-entropy loss, where $IoU = \frac{\min(w_{end}, s_{end}) - \max(w_{start}, s_{start})}{w_{end} - w_{start} + 1}$, guide temporal localization and boundary awareness. The approach achieves state-of-the-art results on WLASL and Slovo and generalizes across architectures, aided by the newly released SlovoExt extension containing 51,000 samples across 1,001 classes to enrich sign-language data and support broader research via released code and pretrained models. The practical impact is a robust, scalable training paradigm that improves ISLR performance in real-world conditions and enhances accessibility for Deaf communities.
Abstract
Accurate recognition and interpretation of sign language are crucial for enhancing communication accessibility for deaf and hard of hearing individuals. However, current approaches of Isolated Sign Language Recognition (ISLR) often face challenges such as low data quality and variability in gesturing speed. This paper introduces a comprehensive model training pipeline for ISLR designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies enhance recognition performance across various ISLR benchmarks and achieve state-of-the-art results on the WLASL and Slovo datasets.
