Table of Contents
Fetching ...

Recognising BSL Fingerspelling in Continuous Signing Sequences

Alyssa Chan, Taein Kwon, Andrew Zisserman

Abstract

Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at https://taeinkwon.com/projects/fs23k/.

Recognising BSL Fingerspelling in Continuous Signing Sequences

Abstract

Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at https://taeinkwon.com/projects/fs23k/.
Paper Structure (41 sections, 3 equations, 15 figures, 6 tables)

This paper contains 41 sections, 3 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: BSL fingerspelling recognition. Video frames from continuous signing where a fingerspelling temporal interval is detected, and hand and lip features are used to correctly recognize the signed letters.
  • Figure 2: The BSL alphabet. Unlike many other sign languages, British Sign Language (BSL) employs bi-manual fingerspelling, which poses additional challenges for recognition due to frequent occlusions between the two hands. Note, these examples are for a left-handed signer.
  • Figure 3: Letter 'p' being signed by three different signers. In (a) the right hand is outstretched, differing from the template fingerspelling. Additionally, in (c) the signer is slightly turned to the left, making the position of the left hand more ambiguous.
  • Figure 4: Fingerspelling recognition network architecture. The model leverages two complementary feature modalities: lip features extracted using AUTO-AVSR ma2023auto and hand features obtained from HAMER pavlakos_reconstructing_2023. Each modality is first passed through an individual linear projection to align feature dimensions, followed by separate Transformer encoders. The encoded features are then concatenated and further processed by a Transformer encoder. Finally, a two-layer MLP predicts per-frame letter labels, which are also used as inputs to the CTC decoder. The dimensions are shown in the numbers beside the arrows. The 384 dimensional hand features cover the vector dimension for both hands.
  • Figure 5: Histogram of letter distribution in FS23K. The letters a (16,577) and e (13,754) occur most frequently, whereas q (143) and x (322) appear least often. This imbalance reflects the natural distribution of letters in in-the-wild BBC broadcast data.
  • ...and 10 more figures