Table of Contents
Fetching ...

Exploring Attention Mechanisms in Integration of Multi-Modal Information for Sign Language Recognition and Translation

Zaber Ibn Abdul Hakim, Rasman Mubtasim Swargo, Muhammad Abdullah Adnan

TL;DR

The paper tackles the challenge of effectively integrating multi-modal information for sign language recognition and translation while controlling computational cost. It introduces a lightweight, plug-in cross-modal attention module that fuses RGB and optical-flow features and employs a two-stage training scheme to reduce overhead, enabling end-to-end CSLR with a single feature extractor and end-to-end-free SLT. Evaluations on RWTH-PHOENIX-Weather 2014 and 2014T show the method yields a WER reduction of about 0.9 points for CSLR and BLEU-4 gains (approximately 0.8) for SLT, with SLT also achieving up to ~3.5% relative BLEU-4 improvements against the baseline Stochastic Transformer. Ablation studies demonstrate the fusion’s superiority over simple merging methods and highlight that applying cross-modal attention after temporal reduction provides the strongest gains. The approach offers a practical path to improved sign-language understanding with lower computational demands and broad applicability to uni-modal backbones.

Abstract

Understanding intricate and fast-paced movements of body parts is essential for the recognition and translation of sign language. The inclusion of additional information intended to identify and locate the moving body parts has been an interesting research topic recently. However, previous works on using multi-modal information raise concerns such as sub-optimal multi-modal feature merging method, or the model itself being too computationally heavy. In our work, we have addressed such issues and used a plugin module based on cross-attention to properly attend to each modality with another. Moreover, we utilized 2-stage training to remove the dependency of separate feature extractors for additional modalities in an end-to-end approach, which reduces the concern about computational complexity. Besides, our additional cross-attention plugin module is very lightweight which doesn't add significant computational overhead on top of the original baseline. We have evaluated the performance of our approaches on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for the sign language translation task. Our approach reduced the WER by 0.9 on the recognition task and increased the BLEU-4 scores by 0.8 on the translation task.

Exploring Attention Mechanisms in Integration of Multi-Modal Information for Sign Language Recognition and Translation

TL;DR

The paper tackles the challenge of effectively integrating multi-modal information for sign language recognition and translation while controlling computational cost. It introduces a lightweight, plug-in cross-modal attention module that fuses RGB and optical-flow features and employs a two-stage training scheme to reduce overhead, enabling end-to-end CSLR with a single feature extractor and end-to-end-free SLT. Evaluations on RWTH-PHOENIX-Weather 2014 and 2014T show the method yields a WER reduction of about 0.9 points for CSLR and BLEU-4 gains (approximately 0.8) for SLT, with SLT also achieving up to ~3.5% relative BLEU-4 improvements against the baseline Stochastic Transformer. Ablation studies demonstrate the fusion’s superiority over simple merging methods and highlight that applying cross-modal attention after temporal reduction provides the strongest gains. The approach offers a practical path to improved sign-language understanding with lower computational demands and broad applicability to uni-modal backbones.

Abstract

Understanding intricate and fast-paced movements of body parts is essential for the recognition and translation of sign language. The inclusion of additional information intended to identify and locate the moving body parts has been an interesting research topic recently. However, previous works on using multi-modal information raise concerns such as sub-optimal multi-modal feature merging method, or the model itself being too computationally heavy. In our work, we have addressed such issues and used a plugin module based on cross-attention to properly attend to each modality with another. Moreover, we utilized 2-stage training to remove the dependency of separate feature extractors for additional modalities in an end-to-end approach, which reduces the concern about computational complexity. Besides, our additional cross-attention plugin module is very lightweight which doesn't add significant computational overhead on top of the original baseline. We have evaluated the performance of our approaches on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for the sign language translation task. Our approach reduced the WER by 0.9 on the recognition task and increased the BLEU-4 scores by 0.8 on the translation task.
Paper Structure (16 sections, 15 equations, 2 figures, 7 tables)

This paper contains 16 sections, 15 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Cross-modal Attention on top of visual alignment constraint with self-mutual distillation learning (SMKD cnn17) is explained in Subfigure \ref{['fig:cma_smkd']}. The components above the dashed line were part of the original architecture. In the original work, the Reduced RGB features were sent to the BiLSTM layer. Components below the dashed line were added to incorporate the optical flow information with RGB features. Subfigure \ref{['fig:attn']} illustrates the Cross Attention module.
  • Figure 2: Cross-modal attention on top of stochastic transformer networks voskou2021stochastic. Only the Cross-Modal Attention module was used additionally compared to the original pipeline.