Table of Contents
Fetching ...

Towards Online Continuous Sign Language Recognition and Translation

Ronglai Zuo, Fangyun Wei, Brian Mak

TL;DR

The first step towards online CSLR is taken, and the online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model.

Abstract

Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT.

Towards Online Continuous Sign Language Recognition and Translation

TL;DR

The first step towards online CSLR is taken, and the online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model.

Abstract

Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT.
Paper Structure (26 sections, 8 equations, 6 figures, 12 tables)

This paper contains 26 sections, 8 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Illustration of (a) the offline recognition scheme and (b) the proposed online framework.
  • Figure 2: Overview of our methodology.
  • Figure 3: Appending a gloss-to-text network with the wait-$k$ policy onto our online CSLR model enables online SLT. Circles and arrows distinguished by varied colors indicate translation outcomes at distinct timings.
  • Figure 4: Boosting an offline model with our online model. A lightweight adapter fuses the features of two well-trained CSLR models, one offline and one online. The parameters of both CSLR models remain frozen.
  • Figure 5: Illustration of the CTC forced alignment algorithm used to compute $q(t,n)$ (Eq. \ref{['eq:var']}). $\varnothing$ is the blank class, $(g_1, \dots, g_n)$ is the gloss sequence. The red lines denote the optimal path, which is obtained by backtracking from the final gloss that has the maximum probability (Eq. \ref{['eq:final_prob']}). Pseudo code is available in Alg. \ref{['alg:segment']}.
  • ...and 1 more figures