Event Stream-based Sign Language Translation: A High-Definition Benchmark Dataset and A Novel Baseline
Shiao Wang, Xiao Wang, Duoqing Yang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang
TL;DR
The paper presents Event-CSL, a large high-definition dataset captured with an event camera to advance sign language translation (SLT) in privacy-preserving, challenging lighting conditions. It introduces EvSLT, a single-stage framework that combines Mamba-based memory aggregation for spatial detail, and a graph-guided spatiotemporal fusion module to integrate spatial and temporal cues, culminating in a gloss-free SLT decoder based on mBART. Through extensive experiments on Event-CSL and existing event-based datasets, the approach achieves state-of-the-art results and demonstrates improved efficiency, with detailed ablations and visualization supporting the effectiveness of both spatial and temporal modeling. The work provides valuable resources and baselines to accelerate progress in event-based SLT and privacy-conscious gesture-to-text translation.
Abstract
Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Traditional SLT methods are typically based on visible light videos, which are easily affected by factors such as lighting variations, rapid hand movements, and privacy concerns. This paper proposes the use of bio-inspired event cameras to alleviate the aforementioned issues. Specifically, we introduce a new high-definition event-based sign language dataset, termed Event-CSL, which effectively addresses the data scarcity in this research area. The dataset comprises 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected across diverse indoor and outdoor scenes, covering multiple viewpoints, lighting conditions, and camera motions. We have also benchmarked existing mainstream SLT methods on this dataset to facilitate fair comparisons in future research.Furthermore, we propose a novel event-based sign language translation framework, termed EvSLT. The framework first segments continuous video features into clips and employs a Mamba-based memory aggregation module to compress and aggregate spatial detail features at the clip level. Subsequently, these spatial features, along with temporal representations obtained from temporal convolution, are then fused by a graph-guided spatiotemporal fusion module. Extensive experiments on Event-CSL, as well as other publicly available datasets, demonstrate the superior performance of our method. The dataset and source code will be released on https://github.com/Event-AHU/OpenESL
