Table of Contents
Fetching ...

IndicSTR12: A Dataset for Indic Scene Text Recognition

Harsh Lunia, Ajoy Mondal, C V Jawahar

TL;DR

This paper addresses the scarcity of large-scale Indian-language data for scene text recognition by introducing IndicSTR12, the largest real-world dataset across 12 major Indian languages, along with a synthetic corpus for 13 languages. It benchmarks three state-of-the-art STR models—PARSeq, CRNN, and STARNet—on both real and synthetic data, showing that transformer-based PARSeq generally yields superior performance, especially with sufficient real data, while highlighting the benefits of multilingual training. The work details meticulous curation and annotation (4-corner polygons, quality reviews) and analyzes failure modes (low resolution, irregular text, matras), underscoring the dataset's challenge and value. Overall, IndicSTR12 significantly advances Indian STR research by providing a comprehensive resource that enables robust multi-language recognition and motivates further data-driven improvements.

Abstract

The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space's lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages. A few works have addressed the same issue, but to the best of our knowledge, they focused on a small number of Indian languages. The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries, while its multilingualism will catalyse the development of robust text detection and recognition models. It was created specifically for a group of related languages with different scripts. The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language. Unlike previous datasets, the images cover a broader range of realistic conditions, including blur, illumination changes, occlusion, non-iconic texts, low resolution, perspective text etc. Along with the new dataset, we provide a high-performing baseline on three models - PARSeq, CRNN, and STARNet.

IndicSTR12: A Dataset for Indic Scene Text Recognition

TL;DR

This paper addresses the scarcity of large-scale Indian-language data for scene text recognition by introducing IndicSTR12, the largest real-world dataset across 12 major Indian languages, along with a synthetic corpus for 13 languages. It benchmarks three state-of-the-art STR models—PARSeq, CRNN, and STARNet—on both real and synthetic data, showing that transformer-based PARSeq generally yields superior performance, especially with sufficient real data, while highlighting the benefits of multilingual training. The work details meticulous curation and annotation (4-corner polygons, quality reviews) and analyzes failure modes (low resolution, irregular text, matras), underscoring the dataset's challenge and value. Overall, IndicSTR12 significantly advances Indian STR research by providing a comprehensive resource that enables robust multi-language recognition and motivates further data-driven improvements.

Abstract

The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space's lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages. A few works have addressed the same issue, but to the best of our knowledge, they focused on a small number of Indian languages. The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries, while its multilingualism will catalyse the development of robust text detection and recognition models. It was created specifically for a group of related languages with different scripts. The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language. Unlike previous datasets, the images cover a broader range of realistic conditions, including blur, illumination changes, occlusion, non-iconic texts, low resolution, perspective text etc. Along with the new dataset, we provide a high-performing baseline on three models - PARSeq, CRNN, and STARNet.
Paper Structure (24 sections, 4 equations, 7 figures, 7 tables)

This paper contains 24 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Samples from IndicSTR12 Dataset: Real word-images (left); Synthetic word-images (right)
  • Figure 2: IndicSTR12 Dataset: Font Variations for the same word - Gujarati or Gujarat
  • Figure 3: IndicSTR12 Dataset Variations, clockwise from Top-Left: Illumination variation, Low Resolution, Multi-Oriented - Irregular Text, Variation in Text Length, Perspective Text, and Occluded.
  • Figure 4: PARSeq architecture. [B] and [P] begin the sequence and padding tokens. T=30 or 30 distinct position tokens. $L_{CE}$ corresponds to cross entropy loss.
  • Figure 5: STARNet model (left) and CRNN model (right)
  • ...and 2 more figures