Table of Contents
Fetching ...

First-place Solution for Streetscape Shop Sign Recognition Competition

Bin Wang, Li Jing

TL;DR

Addresses storefront shop sign recognition in street-view imagery, focusing on detecting signboards and reading store names under challenging urban conditions. Proposes a four-stage multimodal pipeline: signboard detection via an enhanced Mask-RCNN, signboard text detection and KIE with a two-stage detector and Graph Neural Network, Transformer-based text recognition with strong self-supervised pretraining, and reading sequence prediction, augmented by BoxDQN and perspective rectification. The approach leverages self-supervised pretraining and multimodal fusion to achieve robust performance, culminating in a first-place finish with an F-score of 0.6672, computed as $F = \frac{2 \cdot accuracy \cdot recall}{accuracy + recall}$. The study demonstrates practical potential for end-to-end street-view text understanding in maps, navigation, and smart-city analytics, highlighting trends toward large multimodal models and semi-supervised learning.

Abstract

Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.

First-place Solution for Streetscape Shop Sign Recognition Competition

TL;DR

Addresses storefront shop sign recognition in street-view imagery, focusing on detecting signboards and reading store names under challenging urban conditions. Proposes a four-stage multimodal pipeline: signboard detection via an enhanced Mask-RCNN, signboard text detection and KIE with a two-stage detector and Graph Neural Network, Transformer-based text recognition with strong self-supervised pretraining, and reading sequence prediction, augmented by BoxDQN and perspective rectification. The approach leverages self-supervised pretraining and multimodal fusion to achieve robust performance, culminating in a first-place finish with an F-score of 0.6672, computed as . The study demonstrates practical potential for end-to-end street-view text understanding in maps, navigation, and smart-city analytics, highlighting trends toward large multimodal models and semi-supervised learning.

Abstract

Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.
Paper Structure (15 sections, 8 figures, 1 table)

This paper contains 15 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Examples of store sign and store name data.
  • Figure 2: Examples of signboard detection data.
  • Figure 3: Examples of signboard OCR data.
  • Figure 4: The algorithm pipeline. KIE represents key information extraction.
  • Figure 5: The end-to-end network for joint text detection and KIE tasks. The model consists of two major parts. One is a two-stage detector that detects the position of the text and extracts the positional embedding and image embedding of the text. The other is a graph neural network for integrating multimodal features and distinguishing which text belongs to the store signboard.
  • ...and 3 more figures