Table of Contents
Fetching ...

American Sign Language Video to Text Translation

Parsheeta Roy, Ji-Eun Han, Srishti Chouhan, Bhaavanaa Thumu

TL;DR

The paper tackles ASL video-to-text translation by replicating a recent SLT baseline on How2Sign and examining the impact of optimizers, activation functions, and label smoothing through ablations. It leverages I3D-based visual features within a Transformer encoder–decoder framework and evaluates using BLEU and rBLEU to mitigate metric bias. Key findings show that strong regularization (e.g., weight decay, dropout, label smoothing) and deeper decoder configurations improve translation quality, with rBLEU correlating with semantic capture in video content. The work provides a replicable baseline, detailed preprocessing/training procedures, and points to future improvements in visual feature extraction and decoder utilization, potentially aided by pre-trained decoders.

Abstract

Sign language to text is a crucial technology that can break down communication barriers for individuals with hearing difficulties. We replicate and try to improve on a recently published study. We evaluate models using BLEU and rBLEU metrics to ensure translation quality. During our ablation study, we found that the model's performance is significantly influenced by optimizers, activation functions, and label smoothing. Further research aims to refine visual feature capturing, enhance decoder utilization, and integrate pre-trained decoders for better translation outcomes. Our source code is available to facilitate replication of our results and encourage future research.

American Sign Language Video to Text Translation

TL;DR

The paper tackles ASL video-to-text translation by replicating a recent SLT baseline on How2Sign and examining the impact of optimizers, activation functions, and label smoothing through ablations. It leverages I3D-based visual features within a Transformer encoder–decoder framework and evaluates using BLEU and rBLEU to mitigate metric bias. Key findings show that strong regularization (e.g., weight decay, dropout, label smoothing) and deeper decoder configurations improve translation quality, with rBLEU correlating with semantic capture in video content. The work provides a replicable baseline, detailed preprocessing/training procedures, and points to future improvements in visual feature extraction and decoder utilization, potentially aided by pre-trained decoders.

Abstract

Sign language to text is a crucial technology that can break down communication barriers for individuals with hearing difficulties. We replicate and try to improve on a recently published study. We evaluate models using BLEU and rBLEU metrics to ensure translation quality. During our ablation study, we found that the model's performance is significantly influenced by optimizers, activation functions, and label smoothing. Further research aims to refine visual feature capturing, enhance decoder utilization, and integrate pre-trained decoders for better translation outcomes. Our source code is available to facilitate replication of our results and encourage future research.
Paper Structure (22 sections, 3 figures, 7 tables)

This paper contains 22 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The building blocks of the transformer model slt-how2sign-wicv2023
  • Figure 2: Formula for BLEU score
  • Figure 3: Cross Entropy Formula with Label Smoothing