Table of Contents
Fetching ...

EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition

Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang

TL;DR

This work introduces EventSTR, a large-scale, high-definition event-camera benchmark for scene text recognition and proposes SimC-ESTR, a vision-language framework that stacks event streams into frames, uses a Q-former to align visual tokens to a pre-trained LLM, and enhances features with a memory mechanism and a glyph-based error correction module. The dataset comprises 9,928 event sequences at 1280×720 with both Chinese and English characters, enabling robust evaluation under low light, motion, and occlusion. Experiments show SimC-ESTR achieves state-of-the-art BLEU scores on EventSTR and demonstrates the value of memory augmentation and glyph correction, while highlighting limitations related to computational demands and VQA-pretraining bias. The work provides extensive baselines and release plans for code and models, aiming to accelerate research in event-camera–based text recognition and related applications.

Abstract

Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on https://github.com/Event-AHU/EventSTR

EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition

TL;DR

This work introduces EventSTR, a large-scale, high-definition event-camera benchmark for scene text recognition and proposes SimC-ESTR, a vision-language framework that stacks event streams into frames, uses a Q-former to align visual tokens to a pre-trained LLM, and enhances features with a memory mechanism and a glyph-based error correction module. The dataset comprises 9,928 event sequences at 1280×720 with both Chinese and English characters, enabling robust evaluation under low light, motion, and occlusion. Experiments show SimC-ESTR achieves state-of-the-art BLEU scores on EventSTR and demonstrates the value of memory augmentation and glyph correction, while highlighting limitations related to computational demands and VQA-pretraining bias. The work provides extensive baselines and release plans for code and models, aiming to accelerate research in event-camera–based text recognition and related applications.

Abstract

Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on https://github.com/Event-AHU/EventSTR

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Examples illustrating the motivation behind EventSTR. (a) Challenges of scene text recognition under low-light conditions where RGB cameras struggle to capture clear text. (b) Motion blur scenarios that degrade text readability in RGB images. (c) Occlusion issues that hinder text recognition in complex environments. In contrast, (d) shows event camera data that effectively addresses low-light and motion blur challenges due to its high temporal resolution and dynamic range. Additionally, occlusion issues can be mitigated through the reasoning capabilities of LLMs, enabling more robust text recognition in challenging scenarios.
  • Figure 2: An overview of our proposed large language model based event stream scene text recognition framework, termed SimC-ESTR. Given the event streams, we first stack them into a single event frame and use a visual encoder to extract feature representations. These features are passed through a Q-former module to align vision tokens with a pre-trained large language model (LLM), which then generates text. To further enhance the features, we introduce a memory mechanism that leverages contextual samples for better representation. We also address the issue of LLMs occasionally producing incorrect but visually similar Chinese characters by designing a correction module specifically for such cases. More details of these modules will be described in Section \ref{['sec::network']}.
  • Figure 3: Illustration of some representative samples of our proposed EventSTR dataset. The left side displays the event stream, while the right side shows the corresponding first frame image.
  • Figure 4: Statistical analysis for the EventSTR dataset. (a) The number of images with different text lengths. (b) Distribution of the number of characters.
  • Figure 5: The word cloud visually represents the frequency distribution of words in the EventSTR dataset labels. Words that appear more frequently are displayed larger and more prominently, whereas smaller words correspond to those with lower occurrence rates.
  • ...and 2 more figures