Table of Contents
Fetching ...

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

TL;DR

The paper introduces the Bharat Scene Text Dataset (BSTD), a large-scale, open dataset for Indian-language scene text understanding, covering 11 languages plus English and four tasks: detection, script identification, cropped word recognition, and end-to-end recognition. It details a rigorous three-stage dataset curation, meticulous annotation, and a comprehensive baseline pipeline (IndicPhotoOCR) that combines TextBPN++ detection, ViT-based script ID, and PARSeq recognition, complemented by synthetic data from SynthText. The authors provide extensive benchmarks against open-source and commercial baselines, analyze error modes, and demonstrate that BSTD poses unique challenges due to script diversity and data gaps in low-resource languages. They also release an open-source toolkit to promote reproducible research and future expansion of the dataset to broaden linguistic coverage and task capabilities. Overall, BSTD constitutes a significant step toward robust multilingual Indian scene text understanding and sets a practical foundation for future model development and community engagement.

Abstract

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

TL;DR

The paper introduces the Bharat Scene Text Dataset (BSTD), a large-scale, open dataset for Indian-language scene text understanding, covering 11 languages plus English and four tasks: detection, script identification, cropped word recognition, and end-to-end recognition. It details a rigorous three-stage dataset curation, meticulous annotation, and a comprehensive baseline pipeline (IndicPhotoOCR) that combines TextBPN++ detection, ViT-based script ID, and PARSeq recognition, complemented by synthetic data from SynthText. The authors provide extensive benchmarks against open-source and commercial baselines, analyze error modes, and demonstrate that BSTD poses unique challenges due to script diversity and data gaps in low-resource languages. They also release an open-source toolkit to promote reproducible research and future expansion of the dataset to broaden linguistic coverage and task capabilities. Overall, BSTD constitutes a significant step toward robust multilingual Indian scene text understanding and sets a practical foundation for future model development and community engagement.

Abstract

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

Paper Structure

This paper contains 24 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Scene Text Understanding: Case of India. This figure showcases typical street scenes from northern, eastern, western, and southern parts of India, highlighting the country's vast linguistic diversity. With 11 languages (five of them: Hindi, English, Bengali Gujarati and Tamil are shown here) including English commonly featured on signboards, scene text understanding in India presents unique challenges. While significant progress has been made in English Scene Text Recognition, open-source comprehensive effort for Indian language scene text understanding, including large public datasets and comprehensive models, are still limited. Our work seeks to address this gap and advance the field.
  • Figure 2: Language map of India (Source: https://commons.wikimedia.org/wiki/File:Language_region_maps_of_India.svg). The BSTD covers scene text images from 11 prominent languages namely Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil and, Telugu. Additionally, it contains English, as English is often included as one of the languages as signboard in India.
  • Figure 3: Pipeline for constructing a multilingual scene-text dataset from Wikimedia Commons images. The process begins with query construction using combinations of Indian city names and place-related keywords to retrieve relevant images via the Wikimedia Commons API. A two-stage filtering process using DBnet dbnet (pre-trained on Indic Synthetic Data) removes non-textual images and manually filters relevant ones. Scene-text detection annotations are generated using DBnet and refined through manual corrections. Finally, cropped word images are recognized using the PARSeq parseq model (trained on 11 Indic languages using synthetic data), with recognition errors corrected and script language tags added to produce the final annotated dataset.
  • Figure 4: An example image from BSTD along with the corresponding JSON annotation.
  • Figure 5: Box plot highlighting the word length range across all the 12 languages.
  • ...and 8 more figures