Table of Contents
Fetching ...

Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms

Xiao Wang, Yuehang Li, Fuling Wang, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang, Bin Luo

TL;DR

This work tackles sign language translation under real-world conditions by fusing RGB frames with event streams to overcome lighting, motion blur, and frame-rate limitations. It introduces VECSL, a large-scale RGB-Event SLT dataset collected with a DVS346 camera, and a retrieval-guided framework (M^2-SLT) that combines Micro-Sign Retrieval (MiR) and Macro-Sign Retrieval (MaR) with a Hopfield memory and an mBART decoder to achieve state-of-the-art performance. The key contributions are the VECSL dataset, the MiR/MaR retrieval-based architecture, and a comprehensive benchmark showing substantial gains over prior RGB-only or event-only methods, enabling robust multi-modal sign language understanding in diverse environments. Together, these advances provide a solid foundation for real-time, privacy-conscious SLT in unconstrained settings and open avenues for further improvements via larger decoders and semi-supervised learning.

Abstract

Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M$^2$-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.

Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms

TL;DR

This work tackles sign language translation under real-world conditions by fusing RGB frames with event streams to overcome lighting, motion blur, and frame-rate limitations. It introduces VECSL, a large-scale RGB-Event SLT dataset collected with a DVS346 camera, and a retrieval-guided framework (M^2-SLT) that combines Micro-Sign Retrieval (MiR) and Macro-Sign Retrieval (MaR) with a Hopfield memory and an mBART decoder to achieve state-of-the-art performance. The key contributions are the VECSL dataset, the MiR/MaR retrieval-based architecture, and a comprehensive benchmark showing substantial gains over prior RGB-only or event-only methods, enabling robust multi-modal sign language understanding in diverse environments. Together, these advances provide a solid foundation for real-time, privacy-conscious SLT in unconstrained settings and open avenues for further improvements via larger decoders and semi-supervised learning.

Abstract

Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.

Paper Structure

This paper contains 21 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The imaging principle of event cameras and an overview schematic of our proposed M$^2$-SLT.
  • Figure 2: An overview of our proposed framework for sign language translation, i.e., M$^2$-SLT. It consists of four main components, including RGB and event encoding, micro-sign retrieval (MiR), macro-sign retrieval (MaR), and sign language decoding.
  • Figure 3: An illustration of informative video frames selection in RGB-Event Macro-Sign Retrieval (MaR).
  • Figure 4: A visualization of the RGB frames and Event streams for sign language translation.
  • Figure 5: Representative samples of our VECSL dataset.
  • ...and 2 more figures