Bukva: Russian Sign Language Alphabet
Karina Kvanchiani, Petr Surovtsev, Alexander Nagaev, Elizaveta Petrova, Alexander Kapitanov
TL;DR
The paper tackles the data scarcity problem in Russian Sign Language dactyl recognition by introducing Bukva, the first open-source, full-video dataset for the RSL alphabet with 3,757 samples across 33 signs collected from 155 signers. It presents a rigorous data collection and curation pipeline, including an RSL knowledge exam, multi-stage filtration/validation, and time-interval annotation, and demonstrates that a Temporal Shift Module-based approach enables accurate, real-time recognition on CPU using lightweight architectures like MobileNetV2. Key contributions include the public release of Bukva, demo code, and pre-trained models, alongside a benchmark showing an overall top-1 accuracy of $0.836$ under constrained data conditions and an accessible teaching/demo platform. The work advances sign-language education and potential real-world applications (e.g., education tools, signage at stations) while emphasizing ethical data collection and expert involvement, and it sets a foundation for future continuous dactyl datasets and broader fingerspelling research.
Abstract
This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl. Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language. This method is used to spell words without specific signs, such as proper nouns or technical terms. The alphabet learning simulator is an essential isolated dactyl recognition application. There is a notable issue of data shortage in isolated dactyl recognition: existing Russian dactyl datasets lack subject heterogeneity, contain insufficient samples, or cover only static signs. We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition. It contains 3,757 videos with more than 101 samples for each RSL alphabet sign, including dynamic ones. We utilized crowdsourcing platforms to increase the subject's heterogeneity, resulting in the participation of 155 deaf and hard-of-hearing experts in the dataset creation. We use a TSM (Temporal Shift Module) block to handle static and dynamic signs effectively, achieving 83.6% top-1 accuracy with a real-time inference with CPU only. The dataset, demo code, and pre-trained models are publicly available.
