Table of Contents
Fetching ...

MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin, Xiaoxiao He, Dimitris N. Metaxas

TL;DR

The paper tackles CSLR by introducing a multimodal boundary-detection framework that fuses 3D handshape cues with a spatio-temporal skeleton stream through cross-attention, producing refined boundaries for downstream sign recognition. It combines a ST-GCN backbone with velocity/acceleration features and a handshape classifier, and optimizes a frame-wise loss plus a boundary-aware term to improve segmentation accuracy. Boundary-derived segments are evaluated with a state-of-the-art isolated sign classifier, demonstrating substantial gains in segmentation (mF1B) and encouraging recognition performance within a tolerance-based evaluation. The work advances robust CSLR by leveraging multimodal cues and structured skeletal representations, while outlining future directions toward end-to-end CSLR and linguistically aware sign-type differentiation.

Abstract

This paper employs a multimodal approach for continuous sign recognition by first using ML for detecting the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then by recognizing the segmented signs. For improved robustness we use 3D skeletal features extracted from sign language videos to take into account the convergence of sign properties and their dynamics that tend to cluster at sign boundaries. Another focus of this paper is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for detection of 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing-as such signs often differ a bit in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

TL;DR

The paper tackles CSLR by introducing a multimodal boundary-detection framework that fuses 3D handshape cues with a spatio-temporal skeleton stream through cross-attention, producing refined boundaries for downstream sign recognition. It combines a ST-GCN backbone with velocity/acceleration features and a handshape classifier, and optimizes a frame-wise loss plus a boundary-aware term to improve segmentation accuracy. Boundary-derived segments are evaluated with a state-of-the-art isolated sign classifier, demonstrating substantial gains in segmentation (mF1B) and encouraging recognition performance within a tolerance-based evaluation. The work advances robust CSLR by leveraging multimodal cues and structured skeletal representations, while outlining future directions toward end-to-end CSLR and linguistically aware sign-type differentiation.

Abstract

This paper employs a multimodal approach for continuous sign recognition by first using ML for detecting the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then by recognizing the segmented signs. For improved robustness we use 3D skeletal features extracted from sign language videos to take into account the convergence of sign properties and their dynamics that tend to cluster at sign boundaries. Another focus of this paper is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for detection of 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing-as such signs often differ a bit in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

Paper Structure

This paper contains 28 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our proposed multimodal boundary detection. The left branch shows the handshape classification module, which is pretrained on curated handshape datasets using a 3-layer GCN and later used to extract frame-wise handshape features. We pretrain this branch in stage 2. The right branch is the segmentation network based on ST-GCN and CNN modules, which processes all skeletal joints augmented with their velocity and acceleration. We pretrain this branch without the Cross-Modal layer in stage 1. In the middle, a cross-modal attention module with a gating mechanism fuses the handshape and segmentation features to enhance temporal boundary prediction. We fine-tune the right branch with the cross-modal attention module in stage 3. The final segmentation output is further used to extract sign clips and then used for sign recognition.
  • Figure 2: Overview of the experimental procedure. We begin by extracting 2D skeletal keypoints from sign language videos using a pose estimation tool. We use velocity and acceleration information to augment the joint feature for each frame. These are then separated into two branches: hand joints are fed into a pretrained handshape classification model, while all joints are used as input to a segmentation model. The outputs of both branches are fused to produce improved segmentation boundaries. For sign recognition, the segmented sign clips are further passed into a sign recognition model trained on both citation-form signs and signs segmented from continuous signing to produce final sign predictions.
  • Figure 3: Qualitative results on the ASLLRP-S dataset. Each pair of rows corresponds to one ASL sentence video: the top row in each pair shows the ground-truth segmentation (GT), and the bottom row shows the model prediction (Pred). Blue bars indicate sign segments, and white indicates background or non-sign regions. It can be seen that predicted segments from the model closely align with the ground truth, especially in the case of shorter sentences (as shown in the first and second pairs). In longer sentences, although some segments may be missed (as shown in the third pair) or two adjacent segments may occasionally be merged into one (as shown in the fourth pair), the model delivers reasonably accurate segmentation performance overall. This highlights the model’s ability to detect temporal boundaries across sentences with different numbers of signs.