Table of Contents
Fetching ...

Deep Understanding of Sign Language for Sign to Subtitle Alignment

Youngjoon Jang, Jeongsoo Choi, Junseok Ahn, Joon Son Chung

TL;DR

This work addresses asynchronous sign-language subtitle alignment under limited labeled data by integrating a grammar-informed subtitle preprocessing step for British Sign Language, a selective alignment loss that combines $\mathcal{L}_{align}$ with $\mathcal{L}_{neg}$ and $\mathcal{L}_{rel}$, and a self-training loop to exploit model-generated pseudo-labels. The approach uses a multimodal Transformer framework that ingests pre-processed subtitles, sign-language video, and priors from audio-aligned timing to produce frame-level alignment scores. On the BBC BO SL dataset, the method achieves state-of-the-art frame-level accuracy and F1 across IoU thresholds, with ablations confirming the contribution of each component and self-training providing further gains. The results indicate strong potential for scalable sign-language translation and assisted accessibility with reduced dependence on manual labeling.

Abstract

The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data. To achieve this goal, we propose a novel framework with the following contributions: (1) we leverage fundamental grammatical rules of British Sign Language (BSL) to pre-process the input subtitles, (2) we design a selective alignment loss to optimise the model for predicting the temporal location of signs only when the queried sign actually occurs in a scene, and (3) we conduct self-training with refined pseudo-labels which are more accurate than the heuristic audio-aligned labels. From this, our model not only better understands the correlation between the text and the signs, but also holds potential for application in the translation of sign languages, particularly in scenarios where manual labelling of large-scale sign data is impractical or challenging. Extensive experimental results demonstrate that our approach achieves state-of-the-art results, surpassing previous baselines by substantial margins in terms of both frame-level accuracy and F1-score. This highlights the effectiveness and practicality of our framework in advancing the field of sign language video alignment and translation.

Deep Understanding of Sign Language for Sign to Subtitle Alignment

TL;DR

This work addresses asynchronous sign-language subtitle alignment under limited labeled data by integrating a grammar-informed subtitle preprocessing step for British Sign Language, a selective alignment loss that combines with and , and a self-training loop to exploit model-generated pseudo-labels. The approach uses a multimodal Transformer framework that ingests pre-processed subtitles, sign-language video, and priors from audio-aligned timing to produce frame-level alignment scores. On the BBC BO SL dataset, the method achieves state-of-the-art frame-level accuracy and F1 across IoU thresholds, with ablations confirming the contribution of each component and self-training providing further gains. The results indicate strong potential for scalable sign-language translation and assisted accessibility with reduced dependence on manual labeling.

Abstract

The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data. To achieve this goal, we propose a novel framework with the following contributions: (1) we leverage fundamental grammatical rules of British Sign Language (BSL) to pre-process the input subtitles, (2) we design a selective alignment loss to optimise the model for predicting the temporal location of signs only when the queried sign actually occurs in a scene, and (3) we conduct self-training with refined pseudo-labels which are more accurate than the heuristic audio-aligned labels. From this, our model not only better understands the correlation between the text and the signs, but also holds potential for application in the translation of sign languages, particularly in scenarios where manual labelling of large-scale sign data is impractical or challenging. Extensive experimental results demonstrate that our approach achieves state-of-the-art results, surpassing previous baselines by substantial margins in terms of both frame-level accuracy and F1-score. This highlights the effectiveness and practicality of our framework in advancing the field of sign language video alignment and translation.

Paper Structure

This paper contains 14 sections, 4 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: This work aims to align subtitles with continuous signing in sign language interpreted TV broadcast data by leveraging the grammatical systems of British Sign Language. Using two different modalities—video and audio-aligned subtitles—our framework encodes visual features and pre-processes the input query text based on the linguistics of BSL. The output consists of time segments that indicate the points in time when the sign language corresponding to the text is uttered.
  • Figure 2: An illustration of our framework. We input to our model: (1) a pseudo-gloss sequence pre-processed by a subtitle pre-processing mechanism, (2) a positive video aligned with the input query, and (3) a negative video not aligned with the input query. Both videos are encoded with shifted temporal boundaries of the audio-aligned subtitle, denoted as $S_{prior}$ and $S^{neg}_{prior}$. Using a Transformer decoder, the model predicts frame-level alignment between text and video. Note that the negative video is only provided during training.
  • Figure 3: Qualitative results. This figure presents frame-level model predictions, showcasing the timeline of prior, the SAT model, our model trained without the selective alignment loss, our full model, and ground truth. Our model, which incorporates rich sign language linguistics, shows superior alignment with the ground truth compared to other baselines. In addition, the frame-level alignment probability visualisation at the bottom of this figure illustrates our model's clear timeline alignment, demonstrating its effective correlation of text with video content.