EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang; Hao Yin; Zeren Wang; Wenyue Chen; Shengming Li; Dong Wang; Huchuan Lu; Xu Jia

EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang, Hao Yin, Zeren Wang, Wenyue Chen, Shengming Li, Dong Wang, Huchuan Lu, Xu Jia

TL;DR

This work addresses sign language recognition and translation under challenging conditions where RGB video suffers motion blur and privacy concerns. It introduces EvSign, the first event-based CSLR/SLT benchmark with gloss and spoken-language annotations, and proposes a transformer-based framework that processes sparse streaming events via a sparse backbone, local token fusion, and gloss-aware temporal aggregation to capture local and global temporal cues. Experiments on synthetic PHOENIX14T and real EvSign demonstrate competitive performance with very low computational cost (0.34% FLOPS) and a small parameter footprint (44.2% of RGB baselines), outperforming RGB methods in several settings. The results underscore event-based sensing as a promising direction for robust, privacy-conscious CSLR/SLT systems in real-world scenarios.

Abstract

Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition (CSLR) and sign language translation (SLT). To promote the research, we first collect an event-based benchmark EvSign for those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated (PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters. The project is available at https://zhang-pengyu.github.io/EVSign.

EvSign: Sign Language Recognition and Translation with Streaming Events

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 3 figures, 7 tables)

This paper contains 19 sections, 9 equations, 3 figures, 7 tables.

Introduction
Related work
RGB-based sign language recognition
RGB-based sign language translation
Event-based sign language recognition
EvSign benchmark
Benchmark Statistics
Annotation
Evaluation Metrics
Methodology
Overview
Overall framework
Experiments
Datasets and evaluation protocol
Training Details
...and 4 more sections

Figures (3)

Figure 1: Comparison between sign language recognition and translation with RGB and event data. We provide the first benchmark for event-based CSLR and SLT tasks, namely EvSign. Compared with RGB data, event stream can capture smooth movement within microsecond-level response, avoiding motion blur. Furthermore, the sparse event only stresses on the moving targets, such as hands and arms, which can be processed efficiently and protects personal privacy (facial information).
Figure 2: Pipeline of the transformer-based framework for CSLR and SLT tasks.
Figure 3: Visualization of the gloss-aware mask on EvSign dataset.

EvSign: Sign Language Recognition and Translation with Streaming Events

TL;DR

Abstract

EvSign: Sign Language Recognition and Translation with Streaming Events

Authors

TL;DR

Abstract

Table of Contents

Figures (3)