Table of Contents
Fetching ...

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Junuk Cha, Jihyeon Kim, Han-Mu Park

TL;DR

A multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss and a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words is proposed.

Abstract

Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

TL;DR

A multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss and a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words is proposed.

Abstract

Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.
Paper Structure (30 sections, 17 equations, 14 figures, 18 tables)

This paper contains 30 sections, 17 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: Motivation for multi-hand-capable fingerspelling recognition. The input consists of the hand pose sequence extracted from the video in which the word "nad" is fingerspelled using the right hand. Existing methods shi2018americanshi2019fingerspellingfayyazsanavi2024fingerspelling rely on explicit signing-hand detection. However, they misidentify the signing hand when the non-signing hand exhibits large motion changes, which subsequently causes recognition failures. In contrast, our multi-hand-capable fingerspelling recognizer implicitly detects the signing hand from the multi-hand pose sequence to infer the target word. As evidence, the cross-attention map presents high attention values on the frames of the right hand when predicting the word "nad".
  • Figure 2: Overview of the multi-hand-capable fingerspelling recognizer. The hand pose sequence is embedded into a feature space and encoded using our proposed dual-level positional encoding, which consists of hand-identity encoding ($\tau$) and temporal positional encoding ($\eta$). The recognizer's decoder then predicts the next letter token based on the pose-aware, semantically rich encoder features. $\psi$ denotes the standard positional encoding vaswani2017attention, and $W_i$ represents the $i$-th letter of the word. < start> and < end> are special tokens indicating the start and end of the letter token sequence, respectively.
  • Figure 3: Overview of the signing-hand (SF) and monotonic alignment (MA) losses. (a) The signing-hand focus (SF) loss $\cal{L}_\text{SF}$ measures the entropy of the hand-wise attention distribution derived from the cross-attention map between input hand pose tokens and output letter tokens. Minimizing this entropy encourages the recognizer to focus on the single signing hand. (b) The monotonic alignment (MA) loss $\cal{L}_\text{MA}$ penalizes misalignments that violate the natural temporal order between input hand pose tokens and output letter tokens in fingerspelling. Reducing these violations encourages the model to interpret the hand pose tokens in a temporally coherent manner to predict the letter token.
  • Figure 4: Overview of the coarse-to-fine frame-wise letter annotation method. (a) We utilize cross-attention map between input hand pose tokens and output letter tokens to generate coarse frame-wise letter annotations, where $\phi$ denotes a non-letter annotation. (b) To refine the coarse frame-wise letter annotations, we freeze the pre-trained recognizer and train a frame-wise annotation refiner supervised by the coarse frame-wise letter annotations. (c) The trained frame-wise annotation refiner produces refined frame-wise letter annotations. The coarse and refined annotations are compared with the corresponding image frames, where each label–frame pair is linked with arrows, and mismatched cases are highlighted in red.
  • Figure 5: Overview of the frame-wise letter-conditioned generator.$W_i$ is the $i$-th letter of the word, $|W|$ is the word length, $\otimes$ denotes concatenation, and $\psi$ denotes the standard positional encoding vaswani2017attention. The generator embeds each letter token and each noised pose vector through their respective embedding layers. The resulting letter and pose embeddings are concatenated frame-wise and, given a diffusion timestep, are denoised by the generator encoder to produce a clean hand-pose sequence.
  • ...and 9 more figures