First-place Solution for Streetscape Shop Sign Recognition Competition
Bin Wang, Li Jing
TL;DR
Addresses storefront shop sign recognition in street-view imagery, focusing on detecting signboards and reading store names under challenging urban conditions. Proposes a four-stage multimodal pipeline: signboard detection via an enhanced Mask-RCNN, signboard text detection and KIE with a two-stage detector and Graph Neural Network, Transformer-based text recognition with strong self-supervised pretraining, and reading sequence prediction, augmented by BoxDQN and perspective rectification. The approach leverages self-supervised pretraining and multimodal fusion to achieve robust performance, culminating in a first-place finish with an F-score of 0.6672, computed as $F = \frac{2 \cdot accuracy \cdot recall}{accuracy + recall}$. The study demonstrates practical potential for end-to-end street-view text understanding in maps, navigation, and smart-city analytics, highlighting trends toward large multimodal models and semi-supervised learning.
Abstract
Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.
