Table of Contents
Fetching ...

Event Stream-based Sign Language Translation: A High-Definition Benchmark Dataset and A Novel Baseline

Shiao Wang, Xiao Wang, Duoqing Yang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang

TL;DR

The paper presents Event-CSL, a large high-definition dataset captured with an event camera to advance sign language translation (SLT) in privacy-preserving, challenging lighting conditions. It introduces EvSLT, a single-stage framework that combines Mamba-based memory aggregation for spatial detail, and a graph-guided spatiotemporal fusion module to integrate spatial and temporal cues, culminating in a gloss-free SLT decoder based on mBART. Through extensive experiments on Event-CSL and existing event-based datasets, the approach achieves state-of-the-art results and demonstrates improved efficiency, with detailed ablations and visualization supporting the effectiveness of both spatial and temporal modeling. The work provides valuable resources and baselines to accelerate progress in event-based SLT and privacy-conscious gesture-to-text translation.

Abstract

Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Traditional SLT methods are typically based on visible light videos, which are easily affected by factors such as lighting variations, rapid hand movements, and privacy concerns. This paper proposes the use of bio-inspired event cameras to alleviate the aforementioned issues. Specifically, we introduce a new high-definition event-based sign language dataset, termed Event-CSL, which effectively addresses the data scarcity in this research area. The dataset comprises 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected across diverse indoor and outdoor scenes, covering multiple viewpoints, lighting conditions, and camera motions. We have also benchmarked existing mainstream SLT methods on this dataset to facilitate fair comparisons in future research.Furthermore, we propose a novel event-based sign language translation framework, termed EvSLT. The framework first segments continuous video features into clips and employs a Mamba-based memory aggregation module to compress and aggregate spatial detail features at the clip level. Subsequently, these spatial features, along with temporal representations obtained from temporal convolution, are then fused by a graph-guided spatiotemporal fusion module. Extensive experiments on Event-CSL, as well as other publicly available datasets, demonstrate the superior performance of our method. The dataset and source code will be released on https://github.com/Event-AHU/OpenESL

Event Stream-based Sign Language Translation: A High-Definition Benchmark Dataset and A Novel Baseline

TL;DR

The paper presents Event-CSL, a large high-definition dataset captured with an event camera to advance sign language translation (SLT) in privacy-preserving, challenging lighting conditions. It introduces EvSLT, a single-stage framework that combines Mamba-based memory aggregation for spatial detail, and a graph-guided spatiotemporal fusion module to integrate spatial and temporal cues, culminating in a gloss-free SLT decoder based on mBART. Through extensive experiments on Event-CSL and existing event-based datasets, the approach achieves state-of-the-art results and demonstrates improved efficiency, with detailed ablations and visualization supporting the effectiveness of both spatial and temporal modeling. The work provides valuable resources and baselines to accelerate progress in event-based SLT and privacy-conscious gesture-to-text translation.

Abstract

Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Traditional SLT methods are typically based on visible light videos, which are easily affected by factors such as lighting variations, rapid hand movements, and privacy concerns. This paper proposes the use of bio-inspired event cameras to alleviate the aforementioned issues. Specifically, we introduce a new high-definition event-based sign language dataset, termed Event-CSL, which effectively addresses the data scarcity in this research area. The dataset comprises 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected across diverse indoor and outdoor scenes, covering multiple viewpoints, lighting conditions, and camera motions. We have also benchmarked existing mainstream SLT methods on this dataset to facilitate fair comparisons in future research.Furthermore, we propose a novel event-based sign language translation framework, termed EvSLT. The framework first segments continuous video features into clips and employs a Mamba-based memory aggregation module to compress and aggregate spatial detail features at the clip level. Subsequently, these spatial features, along with temporal representations obtained from temporal convolution, are then fused by a graph-guided spatiotemporal fusion module. Extensive experiments on Event-CSL, as well as other publicly available datasets, demonstrate the superior performance of our method. The dataset and source code will be released on https://github.com/Event-AHU/OpenESL
Paper Structure (22 sections, 13 equations, 7 figures, 8 tables)

This paper contains 22 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison between existing event-based sign language datasets and our newly proposed Event-CSL benchmark dataset. The bubble size represents the resolution.
  • Figure 2: An overview of the proposed event-based sign language translation framework. The framework segments continuous video features into clip features and employs a Mamba-based memory aggregation module to compress and aggregate spatial details. The resulting spatial and compressed clip features, together with temporal representations from temporal convolution, are fused in a graph-guided spatiotemporal fusion module. Hypergraphs built from compressed clip features facilitate interaction between memory-enhanced spatial and temporal representations. Finally, a Transformer-based language model generates the corresponding Chinese sentences.
  • Figure 3: Representative samples from our proposed Event-CSL dataset and the corresponding text annotations
  • Figure 4: The samples of video frames on the PHOENIX-2014T and CSL-Daily simulation datasets
  • Figure 5: The word cloud visualization for the textual annotations in our proposed Event-CSL dataset.
  • ...and 2 more figures